Upload
giovannivalente
View
54
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Notes on Econometric Analysis
Citation preview
Econometric Models: Lecture Notes for 2014/15
ESS/DES Econometrics (revised February 2015)
Giovanni Bruno
Department of Economics, Bocconi University, Milano
E-mail address: [email protected]
Contents
Part 1. Linear Models 7
Chapter 1. Introduction 8
1.1. Introduction 8
1.2. The linear population model 10
Chapter 2. The linear regression model 13
2.1. From the linear population model to the linear regression model 13
2.2. The properties of the LRM 14
2.3. Difficulties 15
Chapter 3. The Algebraic properties of OLS 18
3.1. Motivation, notation, conventions and main assumptions 18
3.2. Linear combinations of vector 19
3.3. OLS: definition and properties 19
3.4. Spanning sets and orthogonal projections 27
3.5. OLS residuals and fitted values 30
3.6. Partitioned regression 31
3.7. Goodness of fit and the analysis of variance 37
3.8. Centered and uncentered goodness-of-fit measures 39
Chapter 4. The finite-sample statistical properties of OLS 43
4.1. Introduction 43
4.2. Unbiasedness 433
CONTENTS 4
4.3. The Gauss-Marcov Theorem 44
4.4. Estimating the covariance matrix of OLS 47
4.5. Exact tests of significance with normally distributed errors 48
4.6. The general law of iterated expectation 58
4.7. The omitted variable bias 59
4.8. The variance of an OLS individual coefficient 65
4.9. Residuals from partitioned OLS regressions 70
Chapter 5. The Oaxaca’s model: OLS, optimal weighted least squares and group-wise
heteroskedasticity 71
5.1. Introduction 71
5.2. Embedding the Oaxaca’s model into a pooled regression framework 71
5.3. The OLS estimator in the Oaxaca’s model is BLUE 75
5.4. A general result 77
Chapter 6. Tests for structural change 81
6.1. The Chow predictive test 81
6.2. An equivalent reformulation of the Chow predictive test 85
6.3. The classical Chow test 87
Chapter 7. Large sample results for OLS and GLS estimators 90
7.1. Introduction 90
7.2. OLS with non-spherical error covariance matrix 91
7.3. GLS 96
7.4. Large sample tests 103
Chapter 8. Fixed and Random Effects Panel Data Models 108
8.1. Introduction 108
8.2. The Fixed Effect Model (or Least Squares Dummy Variables Model) 108
CONTENTS 5
8.3. The Random Effect Model 117
8.4. Stata implementation of standard panel data estimators 123
8.5. Testing fixed effects against random effects models 125
8.6. Large-sample results for the LSDV estimator 129
8.7. A Robust covariance estimator 132
8.8. Unbalanced panels 134
Chapter 9. Robust inference with cluster samplings 136
9.1. Introduction 136
9.2. Two-way clustering 137
9.3. Stata implementation 142
Chapter 10. Issues in linear IV and GMM estimation 144
10.1. Introduction 144
10.2. The method of moments 146
10.3. Stata implementation of the TSLS estimator 150
10.4. Stata implementation of the (linear) GMM estimator 151
10.5. Robust Variance Estimators 152
10.6. Durbin-Wu-Hausman Exogeneity test 152
10.7. Endogenous binary variables 155
10.8. Testing for weak instruments 156
10.9. Inference with weak instruments 156
10.10. Dynamic panel data 157
Part 2. Non-linear models 169
Chapter 11. Non-linear regression models 170
11.1. Introduction 170
11.2. Non-linear least squares 170
CONTENTS 6
11.3. Poisson model for count data 171
11.4. Modelling and testing overdispersion 174
Chapter 12. Binary dependent variable models 176
12.1. Introduction 176
12.2. Binary models 177
12.3. Coefficient estimates and marginal effects 180
12.4. Tests and Goodness of fit measures 181
12.5. Endogenous regressors 182
12.6. Independent latent heterogeneity 182
12.7. Multivariate probit models 183
Chapter 13. Censored and selection models 187
13.1. Introduction 187
13.2. Tobit models 187
13.3. A simple selection model 190
Chapter 14. Quantile regression 192
14.1. Introduction 192
14.2. Properties of conditional quantiles 192
14.3. Estimation 193
14.4. An heteroskedastic regression model with simulated data 195
Bibliography 197
Part 1
Linear Models
CHAPTER 1
Introduction
1.1. Introduction
Indeed, causation is not always exactly the same as correlation. Econometrics uses eco-
nomic theory, mathematics and statistics to quantify economic structural relationship, often
in the search of causal links among the variables of interest.
Although rather schematic, the following discussion should convey the basic intuition of
how this process works.
Economic theory provides the econometrician with an economic structural model,
(1.1.1) y = ! (x, ")
where ! : Rk+q ! R. Often, the structural relationship is formulated as a probabilistic model
for a given population of interest. So, y denotes a random scalar, x = (x1... xk)0 2 X ✓ Rk is
a 1 ⇥ k random vector of explanatory variables of interest and " is a 1 ⇥ q random vector of
latent variables. A structural model can be understood as one showing a causal relationship8
1.1. INTRODUCTION 9
from the economic factors of interest, x, to the economic response or dependent variable y.
Often in applications q = 1, which means that " is treated as a catch-all random scalar.
For example, ! (x, ") may be the expenditure function in a population of (possibly) het-
erogenous consumers, with preferences " and facing income and prices x; or it may be the
marshallian demand function for some good in the same population, with x denoting prices
and total consumption expenditure; also it may be the demand function for some input of a
population of (possibly) heterogenous firms facing input and output prices x, with " comprising
technological latent heterogeneity, and so on.1
The individual ! (x, "), with its gradient vector of marginal effects, @x
! (x, "), and hessian
matrix, Dxx
! (x, "), are typically the structural objects of interest, but sometimes attention is
centered upon aggregate structural objects, such as the population-averaged structural func-
tion, ˆ! (x, ") dF (") ,
the population-averaged marginal effects,ˆ
@x
! (x, ") dF (") ,
or the population averaged hessian matrixˆ
Dxx
! (x, ") dF (") .
Statistics supplements the probabilistic model with a sampling mechanism in order to
estimate characteristics of the population from a sample of observables. The population objects
of interest may be, for example, the joint probability distribution of the observables, F (y,x) ,
and its moments, E (y), E (x), E (xx
0), E (xy), the conditional distribution of y given x,
F (y|x) and its moments, E (y|x) and V ar (y|x) . Or functions of the above:
1Wooldridge (2010) prefers to think of ! (x, ") as a structural conditional expectation: E (y|x, ") ⌘ ! (x, ") .
There is nothing in the present analysis that prevents such interpretation.
1.2. THE LINEAR POPULATION MODEL 10
The key question is under what conditions these estimable statistical objects are informa-
tive on ! (x, "). Evidently, to establish a mapping between the structural economic object
of interest and the foregoing statistical objects the econometrician needs to model the rela-
tionship between observables and unobservables in ! (x, ") and do so in a plausible way. The
restrictions that are used to this purpose are said identification restrictions. The next sections
describe the simplest probabilistic model for equation (1.1.1), the linear population model.
1.2. The linear population model
Equation (1.1.1) is a linear model of the population if the following assumptions hold.
P.1: Linearity : ! (x, ") = x
0� + ", with " being a random scalar (q = 1) and � a k ⇥ 1
vector of fixed parameters
P.2: rank [E (xx
0)] = k or, equivalently, det [E (xx
0)] 6= 0
P.3: Conditional-mean-independence of " and x: E ("|x) = 0
Under linearity (P.1) equation (1.1.1) becomes
(1.2.1) y = x
0� + ",
then, given P.3, E (y|x) = x
0�.
An equivalent, but easier to interpret, formulation of assumption P.2 states:
P.2b: No element of x in X can be obtained as linear combinations of the others with
probability equal to one: Pr (a0x = 0) = 1 only if a = 0.
The following proves equivalence of P.2 and P.2b (not crucial and rather technical, it can be
skipped towards the exam). I exploit the properties of the expectation and rank operators.
Assume P.2 and Pr (a0x = 0) = 1 for some conformable constant vector a. Then E (a
0xx
0a) =
0, and so a
0E (xx
0)a = 0, which implies a = 0 by P.2, proving P.2b. Now, assume P.2b and pick
any a 6= 0. Then, Pr (a0x = 0) 6= 1 and so Pr (a0x 6= 0) > 0. But since a
0x 6= 0 is equivalent
to a
0xx
0a > 0, then Pr (a0xx0
a > 0) = Pr (a0x 6= 0) > 0. So, since Pr (a0xx0a � 0) = 1,
1.2. THE LINEAR POPULATION MODEL 11
E (a
0xx
0a) > 0, which in turn implies a
0E (xx
0)a > 0. Therefore, E (xx
0) is positive definite
and so non-singular, that is P.2.
Exercise 1. prove that if x = (1 x1)’ then assumption P.2 is equivalent to V ar (x1) 6= 0.
Solution:
E�
xx
0�=
0
@
1 E (x1)
E (x1) E�
x21�
1
A
and so det E (xx
0) = E
�
x21�
�E2(x1) = V ar (x1), and the claim is proved by noting that for
any k ⇥ k matrix A, rank (A) = k if and only if det (A) 6= 0.
By equation (1.1.1) and assumption P.1, the latent part of ! (x, "), ", satisfies the following
equation
" = y � x
0�
and the marginal effects, @x
! (x, "), satisfy the following:
@x
! (x, ") = �.
By assumption P.3 and the law of iterated expectations E (x") = 0. Since " = y � x
0�, then
we have the system of k moment conditions
(1.2.2) E�
xy � xx
0��
= 0
or E (xy) = E (xx
0)�. Assumption P.2, then, ensures that the foregoing system can be solved
for � to have
(1.2.3) � = E�
xx
0��1E (xy)
At this point the linear probabilistic model establishes a precise mapping between, on the
one hand, the structural objects of interest, ! (x, "), " and @x
! (x, ") and on the other the
observable or estimable objects y, x, E (xx
0) and E (xy). Indeed, ! (x, "), " and @
x
! (x, ")
are equal to unique known transformations of y, x, E (xx
0) and E (xy). This means that
1.2. THE LINEAR POPULATION MODEL 12
! (x, "), " and @x
! (x, ") can be estimated using estimators for E (xx
0) and E (xy), whose
choice depends on the underlying sampling mechanism. The most basic strategy is to carry
out estimation within the linear regression model and its variants. In essence, the linear regres-
sion model is the linear probabilistic model supplemented by a random sampling assumption.
This ensures optimal properties of the ordinary least squares estimator (OLS) and its various
generalizations.
A more restrictive specification of the linear model maintains the assumptions of condi-
tional homoskedasticity and normality
P.4: V ar ("|x) = �2.
P.5: "|x ⇠ N�
0,�2�
.
A more general variant of the linear model, instead, replaces assumption P.3 with
P.3b: E (x") = 0.
Under P.3b it is still true that � = E (xx
0)
�1E (xy) and @x
! (x, ") = �, with the virtue that
the conditional expectation E (y|x) is left unrestricted. Therefore, with P.3b replacing P.3,
the model is more general.
The function x
0�, with � = E (xx
0)
�1E (xy), is relevant in either version of the linear
model and is said the linear projection of y onto x.
Exercise 2. Prove that if x contains 1, then E (x") = 0 is equivalent to E (") = 0 and
cov (",x) = 0 (hint: remember that cov (",x) = E (x")� E (x)E (") ).
Solution: Assume E (x") = 0. Since 1 is an element of x, the first component of E (x") is
E (") = 0, then, given cov (",x) = E (x")� E (x)E ("), cov (",x) = 0. Assume E (") = 0 and
cov (",x) = 0. Then, E (x") = E (x)E (") = 0.
CHAPTER 2
The linear regression model
The linear regression model is a statistical model, as such it incorporates a probabilistic
model of the population and a sampling mechanism that draws the data from the population.
2.1. From the linear population model to the linear regression model
Consider the linear model of the previous chapter: the population equation (1.1.1)
y = ! (x, ")
with the assumptions
P.1: Linearity : ! (x, ") = x
0�+ ", with x = (x1 x2... xk)0 being a k⇥ 1 random vector,
" a random scalar and � = (�1 �2... �k)0 a k ⇥ 1 vector of fixed parameters
P.2: rank [E (xx
0)] = k for all x 2 X .
P.3: Conditional-mean-independence of " and x: E ("|x) = 0
Now, add the random sampling assumption
RS: There is a sample of size n from the population equation, such that the elements of
the sequence {(yi xi1 xi2 ...xik) , i = 1, ..., n} are independently identically distributed
(i.i.d.) random vectors.
Given P.1-P.3 and RS, we have the linear regression model (LRM)
(2.1.1) yi = x
0i� + "i
with x
0i = (xi1 xi2 ...xik), i = 1, ..., n and {"i = yi � x
0i�, i = 1, ..., n} is a sequence of unob-
served i.i.d. errors terms.13
2.2. THE PROPERTIES OF THE LRM 14
2.2. The properties of the LRM
It is convenient to express the LRM in compact matrix form as follows
(2.2.1) y = X� + "
where
y
n⇥1=
0
B
B
B
B
B
B
B
B
B
B
@
y1...
yi...
yn
1
C
C
C
C
C
C
C
C
C
C
A
, Xn⇥k
=
0
B
B
B
B
B
B
B
B
B
B
@
x
01...
x
0i...
x
0n
1
C
C
C
C
C
C
C
C
C
C
A
, "n⇥1
=
0
B
B
B
B
B
B
B
B
B
B
@
"1...
"i...
"n
1
C
C
C
C
C
C
C
C
C
C
A
.
It is not hard to see that model (2.2.1), given P.1-P.3 and RS, satisfies the following properties
LRM.1: Linearity in the parameters
LRM.2: X has full column rank, that is rank (X) = k.
LRM.3: The variables in X are strictly exogenous, that is
E�
"i|x01, ... x
0i, ..., x
0n
�
= 0,
i = 1, ..., n, or more compactly, E ("|X) = 0.
LRM.1 is obvious. LRM.2 requires that no columns of X can be obtained as linear combina-
tions of other columns of X or, equivalently, that a = 0 if Xa = 0, or also equivalently that for
any a 6= 0, there exists at least one observation i = 1, ..., n, such that x0ia 6= 0. P.2 ensures that
this occurs with non-zero probability, which approaches unity as n ! 1. LRM.3, instead, is
a consequence of P.3 and RS. This is proved as follows. By P.3, E ("i|x0i) = 0, i = 1, ..., n or
E (yi|x0i)� x
0i� = 0, i = 1, ..., n. Since
E�
"i|x01, ... x
0i, ..., x
0n
�
= E�
yi|x01, ... x
0i, ..., x
0n
�
� x
0i�
2.3. DIFFICULTIES 15
and by RS, E (yi|x0i) = E (yi|x0
1, x02, ..., x
0n), then
E�
"i|x01, ... x
0i, ..., x
0n
�
= E�
yi|x0i
�
� x
0i� = 0.
If, in addition, P.4 (conditional homoskedasticity) and P.5 (conditional normality) hold for
the population model, then one easily verifies that
LRM.4: V ar ("|X) = �2In.
LRM.5: "|X ⇠ N�
0,�2In�
While LRM.1-LRM.5 are less restrictive than P.1-P.5 and RS and, in most cases, sufficient for
accurate and precise inference, they are still strong assumptions to maintain. Finally, if P.3 is
replaced by P.3b, E (x") = 0, then LRM.3 gets replaced by
LRM.3b: E (
Pni=1 xi"i) = 0, i = 1, ..., n, or more compactly,
E�
X 0"�
= 0.
2.3. Difficulties
Some or all of LRM.1-LRM.5 may not be verified if the population model assumptions
and/or the RS mechanism are not verified in reality. Here is a list of the most important
population issues.
• Non-linearities (P.1 fails): the model is non-linear in the parameters. This leads
LRM.1 to fail.
• Perfect multicollinearity (P.2 fails): some variables in x are indeed linear combina-
tions of the others. LRM.2 fails, but in general this is not a serious problem, it
simply indicates that the model has not been parametrized correctly to begin with.
A different parametrization will restore identification in most cases.
• Endogeneity (P.3 fails): some variables in x are related to ". LRM.3 fails.
• Conditional heteroskedasticity (P.4 fails): the conditional variance depends on x.
LRM.4 fails.
2.3. DIFFICULTIES 16
• Non-normality (P.5 fails): " is not conditionally normal. LRM.5 fails.
Other important problems are instead with the RS assumption.
• Omitted variables: some of the variables in x are not sampled. This implies that
the missing variables cannot enter the conditioning set and have to be treated as
unobserved errors, along with ", which could make LRM.3-LRM.5 fail.
• Measurement error: some of the variables in x are measured with error. We have
the wrong variables in the conditioning set. As in the case of omitted variables,
LRM.3-LRM.5 may fail.
• Endogenous selection: some units in the sample are missing due to events related to
". Also in this case, LRM.3-LRM.5 are likely to fail.
Notice that often problems in the RS mechanism have their roots in the population model.
For example, the presence of non-random variables in x is not in general compatible with an
identically distributed sample and, in consequence with RS. It is easy to verify, though, that
non-random x along with a weaker sampling mechanism only requiring independent sampling
is compatible with LRM.1-LRM.5. Also, the presence of variables in x at different levels of
aggregation may not be compatible with an independent sampling, as observed by Moulton
(1990). In this case, the sampling mechanism can be relaxed by maintaining independence
only across groups of observations and not across observations themselves. See for example
the sampling mechanism described in Section 8.6 for panel data models, in which the sample
is neither identically distributed nor independent across observations.
Finally, it is important to emphasize that even if all the population assumptions and the
RS mechanism are valid, data problems may arise in the form of multicollinearity among
regressors.
• Multicollinearity: some of the variables in X are almost collinear. In the population
this is reflected by det [E (xx
0)] ' 0; and in the sample by det (X 0X) ' 0.
2.3. DIFFICULTIES 17
As we will see in Chapter 4, although multicollinearity does not affect the statistical properties
of the estimators in finite samples, it can severely affect the precision of the coefficient estimates
in terms of large standard errors.
CHAPTER 3
The Algebraic properties of OLS
3.1. Motivation, notation, conventions and main assumptions
We do not agree with Larry (the adult croc), do we? Algebra may be boring, but only if
its purpose is left obscure. Algebra in econometrics provides the bricks to construct estimators
and tests. The fact that most estimators and tests are automatically implemented by statistical
packages is no excuse to neglect the underlying algebra. First, because most does not mean
all and there may be the case that for our research work we have to build the technique by
ourselves. This is especially true for the most recent techniques. A robust hausman test for
panel data models and multiway cluster robust standard errors are just a few examples of
techniques that are not yet coded by the popular statistical packages. Second, even if the
technique is available as a built-in procedure in our preferred statistical package, to use it
correctly we have to know how it is made, which boils down to understanding its algebra.
Finally, often interpretation of results requires that we are aware of the algebraic properties
of estimators and tests. So the material here may seem rather intricate at times, but it is
certainly of practical use.18
3.3. OLS: DEFINITION AND PROPERTIES 19
This chapter is based on my lecture notes in matrix algebra as well as on Greene (2008),
Searle (1982) and Rao (1973). Throughout, I denotes a conformable identity matrix; 0 denotes
a conformable null matrix, vector or scalar, with the appropriate meaning being clear from
the context; y is a real n⇥ 1 vector containing the observations of the dependent variable; X
is a real (n⇥ k) regressor matrix of full column rank.
The do-file algebra_OLS.do demonstrates the results of this chapter using the Stata data
set US_gasoline.dta.
3.2. Linear combinations of vector
Given the real (n⇥ k) matrix A, the columns of A are said linearly dependent if there
exists some non-zero (k ⇥ 1) vector b such thatAb = 0 .
Given the real (n⇥ k) matrix A, the columns of A are said linearly independent if Ab = 0
only if b = 0.
Two real non-zero (n⇥ 1) vectors a and b are said to be orthogonal if a0b = 0. Given two
real non-zero (n⇥ k) matrices A and B, if each column of A is orthogonal to all columns of
B, so that A0B = 0, then A and B are said to be orthogonal.
3.3. OLS: definition and properties
We do not have any model in mind here, just data for the response variable
y =
0
B
B
B
B
B
B
B
B
B
B
@
y1...
yi...
yn
1
C
C
C
C
C
C
C
C
C
C
A
3.3. OLS: DEFINITION AND PROPERTIES 20
and the n⇥ k regressor matrix
X =
0
B
B
B
B
B
B
B
B
B
B
@
x
01...
x
0i...
x
0n
1
C
C
C
C
C
C
C
C
C
C
A
,
where x
0i = (xi1 xi2 ...xik) . I only maintain that rank (X) = k.
We aim at finding an optimal approximation of y using the information contained in y
and X. One such approximation can be obtained through the ordinary least squares estimator
(OLS), b, defined as the minimizer of the residual sum of squares S (bo):
b = argmin
b
o
S (bo) ,
where
S (bo) = (y �Xbo)0(y �Xbo) .
Geometrically, Xb is an optimal approximation of y in that it minimizes the euclidean distance
from the vector y to the hyperplane Xbo. As such, b satisfies
@S (bo)
@bo
�
�
�
�
b
o
=b
= 0.
By expanding (y �Xbo)0(y �Xbo):
S (bo) = y
0y � b
0oX
0y � y
0Xbo + b
0oX
0Xbo
= y
0y � 2y
0Xbo + b
0oX
0Xbo.
Then, taking the partial derivatives
@S (bo)
@bo= �2X 0
y + 2X 0Xbo
3.3. OLS: DEFINITION AND PROPERTIES 21
so that the first order conditions (OLS normal equations) of the minimization problem are
(3.3.1) �X 0y +X 0Xb = 0,
with the resulting formula for the OLS estimator
(3.3.2) b =
�
X 0X��1
X 0y
Notice that
• The estimator exists since X 0X is non-singular, X being of full-column rank.
• The estimator is a true minimizer since the Hessian of S (bo),
@2S (bo)
@bo@b0o
= 2X 0X,
is a positive definite matrix (i.e. S (bo) is globally convex in bo). The latter is easily
proved as follows. A matrix A is said to be positive definite if the quadratic form
c
0Ac > 0 for any conformable vector c 6= 0. By the full column rank assumption
z = Xc 6= 0 for any c 6= 0 therefore c
0X 0Xc = z
0z =
nX
i=1
z2i > 0 for any c 6= 0.
The OLS residuals are defined as
(3.3.3) e = y �Xb
By (3.3.1) it follows that e and X are orthogonal:
X 0(y �Xb) = 0.(3.3.4)
Therefore, if X contains a column of all unity elements, say 1, three important implications
follows.
(1) The sample mean of e is zero: 1
0e =
nX
i=1
ei = 0 and consequently, e =1
n
nX
i=1
ei = 0.
3.3. OLS: DEFINITION AND PROPERTIES 22
(2) The OLS regression line passes through the point sample means (y,x), that is y = x
0b,
where y = (
Pni=1 yi) /n and
x
0=
1n
nX
i=1
x1i . . . 1n
nX
i=1
xki
!
(it follows straightforwardly from 1
0e = 1
0(y �Xb) = 0).
(3) Let
(3.3.5) y = Xb
denote the OLS predicted values of y, and ˆ
y denote their sample mean
y =
1
n
nX
i=1
yi
then, since the sample mean of Xb equals x
0b,
y = y.
3.3.1. Stata implementation: get your Stata data file with use. All Stata data
files can be recognized by their filetype dta. Suppose you have y and X within a Stata data
file called, say, mydata.dta, stored in your Stata working directory and that you have just
launched Stata on your laptop. To get your data into memory, from the Stata command line
execute use followed by the name of the data file (specifying the filetype dta is not necessary
since use only supports dta files):
use mydata
If mydata.dta is not in your Stata working directory but somewhere else in your laptop,
then you must specify the path of the dta file. For example, if you have a mac and your data
file is in the folder /Users/giovanni you will write
use /Users/giovanni/mydata
If you have a pc and your file is in c:\giovanni
3.3. OLS: DEFINITION AND PROPERTIES 23
use c:\giovanni\mydata
If the path involves folders with names that include blanks, then include the whole path
into double quotes. For example:
use "/Users/giovanni/didattica/Greene/dati/ch. 1/mydata"
3.3.2. Stata implementation: the help command. To know syntax, options, usage
and examples for any Stata command, write help from the command line followed by the
name of the command for which you want help. For example,
help use
will make appear a new window describing use:
3.3. OLS: DEFINITION AND PROPERTIES 24
.
insheet, [D] odbc, [D] save, [D] sysuse, [D] webuse infile (free format), [D] infile (fixed format), [D] infix, [D] Help: [D] compress, [D] datasignature, [D] fdasave, [D] haver, [D]
Manual: [D] use
Also see
. tab foreign, nolabel . use using myauto if foreign==1 . tab foreign, nolabel . use if foreign == 0 using myauto
. describe . use make rep78 foreign using myauto
. save myauto . keep make price mpg rep78 weight foreign . use http://www.stata-press.com/data/r11/auto, clear
. replace rep78 = 3 in 12 . use http://www.stata-press.com/data/r11/auto
Examples
unlikely that you will ever want to specify this option. nolabel prevents value labels in the saved data from being loaded. It is
the current data have not been saved to disk. clear specifies that it is okay to replace the data in memory, even though
Options
In the second syntax for use, a subset of the data may be read.
quotes. filename contains embedded spaces, remember to enclose it in double filename is specified without an extension, .dta is assumed. If your use loads a Stata-format dataset previously saved by save into memory. If
Description
File > Open...
Menu
use [varlist] [if] [in] using filename [, clear nolabel]
Load subset of Stata-format dataset
use filename [, clear nolabel]
Load Stata-format dataset
Syntax
[D] use Use Stata dataset
Title
3.3. OLS: DEFINITION AND PROPERTIES 25
3.3.3. Stata implementation: OLS estimates with regress. Now that you have
loaded your data into memory, Stata can work with them. Suppose your dependent variable
y is called depvar and that X contains two variables, x1 and x2. To run the OLS regression of
depvar on x1 and x2 with the constant term included, you write regress followed by depvar
and, then, the names of the regressors:
regress depvar x1 x2
The following example shows the regression in example 1.2 of Greene (2008) with annual
values of US aggregate consumption (c) used as the dependent variable and regressed on
annual values of US personal income (y) for the period 1970-1979.
.
_cons -67.58063 27.91068 -2.42 0.042 -131.9428 -3.218488 y .9792669 .031607 30.98 0.000 .9063809 1.052153 c Coef. Std. Err. t P>|t| [95% Conf. Interval]
Total 64972.1238 9 7219.12487 Root MSE = 8.193 Adj R-squared = 0.9907 Residual 537.004616 8 67.125577 R-squared = 0.9917 Model 64435.1192 1 64435.1192 Prob > F = 0.0000 F( 1, 8) = 959.92 Source SS df MS Number of obs = 10
. regress c y
. use "/Users/giovanni/didattica/Greene/dati/ch. 1/1_1.dta"
regress includes the constant term (the unity vector) by default and always with the
name _cons. If you don’t want it, just add the regress option, noconstant:
regress depvar x1 x2, noconstant
Notice that always, according to a general rule of the Stata syntax, the options of any Stata
command follow the comma symbol. This means that if you wish to specify options you have
to write the comma symbol after the last argument of the command, so that everything to the
3.3. OLS: DEFINITION AND PROPERTIES 26
right of the comma symbol is held by Stata as an option. Options can be more than one. Of
course, if you do not wish to include options don’t write the comma symbol.
After execution, regress leaves behind a number of objects in memory, mainly scalars
and matrices, that will stay there, available for use, until execution of the next estimation
command. To know what these objects are, consult the section Saved results in the help
of regress, where you will find the following description.
e(sample) marks estimation sample Functions
e(V_modelbased) model-based variance e(V) variance-covariance matrix of the estimators e(b) coefficient vector Matrices
e(asobserved) factor variables fvset as asobserved e(asbalanced) factor variables fvset as asbalanced e(marginsok) predictions allowed by margins e(predict) program used to implement predict e(estat_cmd) program used to implement estat e(properties) b V e(vcetype) title used to label Std. Err. e(vce) vcetype specified in vce() e(clustvar) name of cluster variable e(title) title in estimation output when vce() is not ols e(wexp) weight expression e(wtype) weight type e(model) ols or iv e(depvar) name of dependent variable e(cmdline) command as typed e(cmd) regress Macros
e(rank) rank of e(V) e(N_clust) number of clusters e(ll_0) log likelihood, constant-only model normal errors e(ll) log likelihood under additional assumption of i.i.d. e(rmse) root mean squared error e(F) F statistic e(r2_a) adjusted R-squared e(r2) R-squared e(df_r) residual degrees of freedom e(rss) residual sum of squares e(df_m) model degrees of freedom e(mss) model sum of squares e(N) number of observations Scalars
regress saves the following in e():
Saved results
3.4. SPANNING SETS AND ORTHOGONAL PROJECTIONS 27
You should be already familiar with some of the e() objects in the Scalars and Matrices
parts. At the end of the course you will be able to understand most of them. Don’t worry
about the Macros and Functions parts, they are rather technical and, however, not relevant
for our purposes.
To know the values taken on by the e() objects, execute the command ereturn list just
after the regress instruction. In our regression example we have:
e(sample) functions:
e(V) : 2 x 2 e(b) : 1 x 2matrices:
e(estat_cmd) : "regress_estat" e(model) : "ols" e(predict) : "regres_p" e(properties) : "b V" e(cmd) : "regress" e(depvar) : "c" e(vce) : "ols" e(marginsok) : "XB default" e(title) : "Linear regression" e(cmdline) : "regress c y"macros:
e(rank) = 2 e(ll_0) = -58.08502782843004 e(ll) = -34.10649331948547 e(r2_a) = .9907017016262866 e(rss) = 537.0046160573024 e(mss) = 64435.11918375102 e(rmse) = 8.193020017500434 e(r2) = .9917348458900325 e(F) = 959.919036180133 e(df_r) = 8 e(df_m) = 1 e(N) = 10scalars:
. ereturn list
3.4. Spanning sets and orthogonal projections
Consider the n-dimensional Euclidean space Rn and the (n⇥ k) real matrix A. Then, each
column of A belongs to Rn and the set of all linear combinations of the columns of A is said
the space spanned by the columns of A (or also the range of A), denoted by R (A) .
3.4. SPANNING SETS AND ORTHOGONAL PROJECTIONS 28
R (A) can be easily proved to be a subspace of Rn (it is obvious that R (A) ✓ Rn; R (A)
is a vector space since, given any two vectors a1 and a2 belonging to R (A), then a1 + a2 2
R (A) and ca1 2 R (A) for any real scalar c). Since each element of R (A) is e vector of n
components, R (A) is said to be a vector space of order n. The dimension of R (A), denoted
by dim [R (A)], is the maximum number of linearly independent vectors in R (A). Therefore,
dim [R (A)] = rank (A) and if A is of full column rank, then dim [R (A)] = k.
The set of all vectors in Rn that are orthogonal to the vectors of R (A) is denoted by A?.
I now prove that A? is a subspace of Rn. A? ✓ Rn by definition. Given any two vectors
b1 and b2 belonging to A? and for any a 2 R (A), b
01a = 0 and b
02a = 0, but then also
(b1 + b2)0a = 0 and, for any scalar c, (cb0
1)a = 0, which completes the proof.
Importantly, it is possible to prove, but not pursued here, that
(3.4.1) dim⇣
A?⌘
= n� rank (A) .
A? is commonly referred to as the space orthogonal to R (A), or also the orthogonal comple-
ment of R (A) .
Exercise 3. Prove that any subspace of Rn contains the null vector.
For simplicity, assume A of full column rank and define the operator P[A] as
P[A] = A�
A0A��1
A0.
As an exercise you can verify that P[A] is a symmetric (P 0[A] = P[A]) and idempotent (P[A]P[A] =
P[A]) matrix. With this two properties, P[A] is said an orthogonal projector. In geometrical
terms, P[A] projects vectors onto R (A) along a direction that is parallel to the space orthogonal
to R (A), A?. Symmetrically,
M[A] = I � P[A]
is the orthogonal projector that projects vectors onto A? along a direction that is parallel to
the space orthogonal to A?, R (A).
3.4. SPANNING SETS AND ORTHOGONAL PROJECTIONS 29
Exercise 4. Prove that M[A] is an orthogonal projector (hint: just verify that M[A] is
symmetric and idempotent).
The properties of orthogonal projectors, established by the following exercises, are readily
understood, once one grasps the geometrical meaning of orthogonal projectors. They can be
also demonstrated algebraically, which is what the exercises require.
Exercise 5. Given two (n⇥ k) real matrices A and B, both of full column rank, prove
that if A and B span the same space than P[A] = P[B] (hint: prove that A can be always
expressed as A = BK where Kis a non-singular (k ⇥ k) matrix).
Solution: If R (A) coincides with R (B), then every column of A belongs to R (B), and
as such every column of A can be expressed as a linear combination of the columns of B,
A = BK, where K is (k ⇥ k) . Therefore, P[A] = BK (K 0B0BK)
�1K 0B. An important result
of linear algebra states that given two conformable matrices C and D, then rank (CD)
min [rank (C) , rank (D)] (see Greene (2008), p. 957, (A-44)). Since both A and B have rank
equal to k, in the light of the foregoing inequality, k min [k, rank (K)], which implies that
rank (K) � k, and since rank (K) > k is not possible, then rank (K) = k and K is non-
singular. Finally, by the property of the inverse of the product of square matrices (see Greene
(2008), p. 963, (A-64))
P[A] = BK�
K 0B0BK��1
K 0B0
= BKK�1�
B0B��1 �
K 0��1K 0B0
= P[B].
Exercise 6. Prove that P[A] and M[A] are orthogonal, that is P[A]M[A] = 0.
3.5. OLS RESIDUALS AND FITTED VALUES 30
3.5. OLS residuals and fitted values
The foregoing results are useful to properly understand the properties of OLS. But before
going on, do the the following exercise.
Exercise 7. Given any (n⇥ 1) real vector v lying onto R (A) prove that P[A]v = v and
M[A]v = 0 (hint: express v as v = Ac, where c is a real (k ⇥ 1) vector).
From exercise 7 it clearly follows that
(3.5.1) P[A]A = A
and
(3.5.2) M[A]A = 0.
Using the OLS formula in (3.3.2)
(3.5.3) e = M[X]y,
where
M[X] = I �X�
X 0X��1
X 0.
Therefore, the OLS residual vector, e, is the orthogonal projection of y onto the space orthog-
onal to that spanned by the regressors, X?. For this reason M[X] is said the “residual maker”.
From (3.3.2) and (3.3.5) it follows that
ˆ
y = P[X]y
and so the vector of OLS predicted (fitted) values, ˆ
y, is the orthogonal projection onto the
space spanned by the regressors, R (X). Clearly, e
0ˆ
y = 0 (see exercise 6 or also equation
(3.3.4)), therefore the OLS method decomposes y into two orthogonal components
(3.5.4) y =
ˆ
y + e.
3.6. PARTITIONED REGRESSION 31
3.5.1. Stata implementation: an important post-estimation command, predict.
In applications ˆ
y and e are useful for a number of purposes. They can be obtained as new vari-
ables in your Stata data set using the post-estimation command predict. The way predict
works is simple. Imagine you have just executed your regress instruction. To have now ˆ
y in
your data as a variable called, say, y_hat, just write from the command line:
predict y�hat
You have thereby created a new variable with name y_hat that contains the ˆ
y values. Fitted
values are the default calculation of predict, if you want residuals just add the res option:
predict resid , res
and you have got a new variable in your data called resid that contains the e values.
It is important to stress that predict supports any estimation command, not only regress.
So, it can be implemented, for example, after xtreg in the context of panel data.
3.6. Partitioned regression
Partition X =
⇣
X1 X2
⌘
and, accordingly,
b =
0
@
b1
b2
1
A
Exercise 8. Prove that if X is of full column rank, so are X1 and X2.
Exercise 9. Prove that if X is of full column rank, then M[X1]X2 is of f.c.r. Solution:
I prove the result by contradiction and assume that P[X1]X2b = X2b for some vector b 6= 0.
Therefore, X1a = X2b, where a = (X 01X1)
�1X 01X2b, or Xc = 0, where c
0= (a
0 � b
0), which
leads to a contradiction since c 6= 0 and X is of f.c.r.
3.6. PARTITIONED REGRESSION 32
Reformulate the normal equations accordingly0
@
X 01
X 02
1
A
y �
0
@
X 01X1 X 0
1X2
X 02X1 X 0
2X2
1
A
0
@
b1
b2
1
A
= 0
or
X 01y �X 0
1X1b1 �X 01X2b2 = 0(3.6.1)
X 02y �X 0
2X1b1 �X 02X2b2 = 0(3.6.2)
Expliciting the first system of equations for b1
(3.6.3) b1 =�
X 01X1
��1X 0
1 (y �X2b2) ,
which shows at once the important result that if X1 and X2 are orthogonal, then
b1 =�
X 01X1
��1X 0
1y
and
b2 =�
X 02X2
��1X 0
2y,
that is b1 (b2) can be obtained by the reduced OLS regression of y on X1 (X2).
In general situations things are not so simple, but it is still possible to work out both
components of b in closed form solution. Replace the right hand side of equation (3.6.3) into
the second system (3.6.2) to obtain
X 02y �X 0
2X1�
X 01X1
��1X 0
1 (y �X2b2)�X 02X2b2 = 0
or equivalently, using the orthogonal projector notation P[X1] for X1 (X01X1)
�1X 01,
X 02y �X 0
2P[X1] (y �X2b2)�X 02X2b2 = 0.
3.6. PARTITIONED REGRESSION 33
Rearrange the foregoing equation
X 02
�
I � P[X1]
�
y +X 02
�
P[X1] � I�
Xb2 = 0
so that eventually
b2 =�
X 02M[X1]X2
��1X 0
2M[X1]y.
By symmetry,
b1 =�
X 01M[X2]X1
��1X 0
1M[X2]y.
By inspecting either formula it emerges that b2 (b1 respectively) can be obtained by a
regression where the dependent variable is the residual vector obtained by regressing y on
X1 (X2) and the regressors are the residuals obtained from the regressions of each column
of X2 (X1) on X1 (X2). This important result is known in the econometric literature as
Frisch-Waugh-Lovell Theorem.
Since b exists, so do its components, which proves that X 01M[X2]X1 and X 0
2M[X1]X2 are
non-singular when X is of full column rank. This result can be also verified by direct inspection,
as suggested by the following exercise.
Exercise 10. Prove that X 02M[X1]X2 is positive definite if X is of f.c.r. (hint: use exercise
9 to prove that M[X1]X2 is of full column rank and then the fact that M[X1] is symmetric and
idempotent). Solution: By exercise 9, M[X1]X2a 6= 0 if a 6= 0. Let z = M[X1]X2a, hence
a
0X 02M[X1]X2a = z
0z is a sum of squares with at least one positive element. Therefore,
a
0X 02M[X1]X2a > 0.
Exercise 11. Partitioning X =
⇣
X1 1
⌘
, where 1 is the (n⇥ 1) vector of all unity
elements, prove that M[1] = I � 1 (1
01)
�11
0 transforms all variables in deviations from their
sample means, and so that the OLS estimator b1 can be obtained by regressing y demeaned
onX demeaned.
3.6. PARTITIONED REGRESSION 34
The following result on the decomposition of orthogonal projectors into orthogonal com-
ponents will be useful in a number of occasions later on.
Lemma 12. Given the partitioning X = (X1 X2) , the following representation of P[X]
holds
(3.6.4) P[X] = P[X1] + P[
M[X1]X2]
,
with both terms in the right hand side of (3.6.4) being orthogonal1.
3.6.1. Add one regressor. Given the initial regressor matrix, X, include an additional
regressor z (z is a non-zero (n⇥ 1) vector), so that there is a larger regressor matrix, W ,
partitioned as W =
⇣
X z
⌘
.
Now, consider the OLS estimator from the regression of y on W
0
@
d
c
1
A
=
�
W 0W��1
W 0y
with the resulting OLS residual
u = y �W
0
@
d
c
1
A
= y �Xd� zc(3.6.5)
1Equation (3.6.4) can be proved directly using the formula for the inverse of the 2 ⇥ 2 partitioned inverse, or
indirectly, but more easily, by using the FWL Theorem and noticing that, for any non-null y and X = (X1 X2)
of f.c.r., P[X]y = X1b1 + X2b2, where b1 = (X 01X1)
�1 X 01 (y �X2b2) and b2 =
�X 0
2M[X1]X2
��1X 0
2M[X1]y.
You just have to plug the right hand sides of b1 and b2 into the right hand side of P[X]y = X1b1+X2b2, work
through all the algebraic simplifications and eventually notice that the equation you end up with, P[X]y =⇣P[X1] + P[M[X1]X2]
⌘y, must hold for any non-null y, so that P[X] = P[X1] + P[M[X1]X2].
3.6. PARTITIONED REGRESSION 35
and the formula for c in closed form solution, obtained as a specific application of the general
analysis of the previous section,
c =
�
z
0M[X]z��1
z
0M[X]y
=
z
0M[X]y
z
0M[X]z(3.6.6)
where M[X] = I �X (X 0X)
�1X 0.
Exercise 13. Derive the formula for d in closed form solution.
We wish to prove that the residual sum of squares always decreases when one regressor is
added to X, i.e. given e defined as in (3.3.3) u
0u < e
0e. To this purpose, it is convenient to
express d as in equation (3.6.3)
d =
�
X 0X��1
X 0(y � zc) .
Replacing the foregoing equation into (3.6.5) yields
u = y �X�
X 0X��1
X 0(y � zc)� zc
= y �X�
X 0X��1
X 0y +X
�
X 0X��1
X 0zc� zc
= M[X]y �M[X]zc.
Since M[X]y = e (from (3.5.3), remember that M[X] is the “residual maker”)
u = e�M[X]zc,
so that the residual sum of squared for the enlarged regression is
u
0u = e
0e� e
0M[X]zc� cz0M[X]e+ c2z0M[X]z
= e
0e� 2cz0M[X]e+ c2z0M[X]z.(3.6.7)
3.6. PARTITIONED REGRESSION 36
Then, by replacing e with M[X]y into the second addendum of equation (3.6.8) and considering
that M[X] is idempotent gives
u
0u = e
0e� 2cz0M[X]y + c2z0M[X]z.
Finally, replace c from (3.6.6) into the foregoing equation to have
u
0u = e
0e� 2
�
z
0M[X]y�2
z
0M[X]z+
�
z
0M[X]y�2
z
0M[X]z
= e
0e�
�
z
0M[X]y�2
z
0M[X]z(3.6.8)
and since (
z
0M[X]y)2
z
0M[X]z> 0,
(3.6.9) u
0u < e
0e.
Exercise 14. How would formulas for c, d and u
0u change if the new regressor z is
orthogonal to X?
3.6.2. The squared coefficient of partial correlation (not covered in class, but
good and easy exercise). The squared coefficient of partial correlation between y and z,
r⇤2yz,
r⇤2yz =
�
z
0M[X]y�2
z
0M[X]zy0M[X]y
,
measures the extent to which y and z are related net of the variation of X. In this sense it is
closely related to c and indeed
c = r⇤2yzy
0M[X]y
z
0M[X]y.
Moreover by (3.5.3), y0M[X]y = e
0e, and hence given equation (3.6.8) it has
u
0u = e
0e
�
1� r⇤2yz�
.
3.7. GOODNESS OF FIT AND THE ANALYSIS OF VARIANCE 37
3.7. Goodness of fit and the analysis of variance
Assume that the unity vector, 1, is part of the regressor matrix X. Total variation in y can
be expressed by the following sum of squares, referred to as Sum of Squared Total deviations
SST =
nX
i=1
(yi � y)2
or, given what established in exercise 11,
SST = y
0M[1]y
Notice that SST is the sample variance of y, SST/ (n� 1), times the appropriate degrees of
freedom correction n�1. Incidentally, the degrees-of-freedom correction in the sample variance
is just n�1 and not n, since M[1]y are the residuals from the regression of y on 1 (see exercise
11) and so there could be no more than n�1 linearly independent vectors in the space to which
M[1]y belongs, 1?. In fact, since rank (1) = 1, then given equation (3.4.1), dim�
1
?�= n�1.
By the orthogonal decomposition for y in (3.5.4),
M[1]y = M[1]ˆy +M[1]e.
But since e and X are orthogonal and X contains 1, it follows that 1
0e = 0, thereby
(3.7.1) M[1]e = e
and
M[1]y = M[1]ˆy + e.
Then,
SST =
ˆ
y
0M[1]ˆy + 2e
0M[1]ˆy + e
0e.
3.7. GOODNESS OF FIT AND THE ANALYSIS OF VARIANCE 38
But since equation (3.7.1) holds, ˆy = Xb and e and X are orthogonal, then e
0M[1]ˆy = e
0ˆ
y =
e
0Xb = 0 and so SST simplifies to
SST =
ˆ
y
0M[1]ˆy + e
0e.
Throughout, I stick to Greene’s acronyms and refer to ˆ
y
0M[1]ˆy as SSR (regression sums
of squares) and e
0e as SSE (error sum of squares). Notice, however, that the name of “error
sum of squares” may be misleading in contexts where the distinction between random errors
and estimated residuals is crucial, given that e0e actually represents the part of total variation
explained by OLS residuals and not errors. For this reason, I continue to refer to e
0e as the
residual sum of squares and sometimes abbreviate it with the Greene’s acronym SSE.
As it happens for SST , SSE is the sample variance of residuals times the appropriate
degrees-of-freedom correction, n � k. Again, the degrees-of-freedom correction in the sample
variance is just n� k and not n since in the residual space, X?, there could be no more than
n� k linearly independent vectors. This follows from the assumption that X is of full column
rank, thereby rank (X) = k, then given equation (3.4.1), dim�
X?�= n� k.
The coefficient of determination, R2, is defined as
(3.7.2) R2=
SSR
SST=
ˆ
y
0M[1]ˆy
y
0M[1]y
and since ˆ
y
0M[1]ˆy = yM[1]y � e
0e,
R2= 1� e
0e
y
0M[1]y.
Therefore, if the constant term is included into the regression it has that 0 R2 1 and
R2 measures the portion of total variation in y explained by the OLS regression; in this sense
R2 is a measure of goodness of fit2. There are two interesting extreme cases. If all regressors,
apart from 1, are null vectors, then ˆ
y lies onto the space spanned by 1 and M[1]ˆy = 0, so
that eventually R2= 0. Only the constant term has explanatory power in this case, and the
2If the constant term is not included into the regression than (3.7.1) does not hold and R2
may be negative.
3.8. CENTERED AND UNCENTERED GOODNESS-OF-FIT MEASURES 39
regression is an horizontal line with intercept equal to the sample mean of y. If y lies already
onto R (X), then y =
ˆ
y (and also e
0e = 0) and R2
= 1, a perfect (but useless) fit3.
A problem with the R2 measure is that it never decreases when a regressor is added to
X (this emerges straightforwardly from the R2 formula and the inequality in (3.6.9)) and in
principle one can obtain artificially high R2 by inflating the model with regressors (the extreme
case of R2= 1 is attained if n = k, since in this case y ends up to lie onto R (X)). This
problem may be obviated by using the corrected R2, R2, defined by including into the formula
of R2 the appropriate degrees of freedom corrections:
R2= 1� SSE/ (n� k)
SST/ (n� 1)
.
In fact, R2 does not necessarily increases when one more regressor is added.
Exercise 15. Prove that, given W , u and e defined as in section 3.6.1, the coefficient of
determination resulting from the regression of y on W is
R2W = R2
+
�
1�R2�
r⇤2yz
where R2 is the goodness of fit measure from the reduced regression.
Exercise 16. Prove that
R2= 1� n� 1
n� k
�
1�R2�
.
3.8. Centered and uncentered goodness-of-fit measures
Consider the OLS regression of y on the sample regressor matrix X and let b denote the
OLS vector. The centered and uncentered R-squared measures (see Hayashi (2000), p. 20) for
3I’m maintaining throughout the obvious assumption that in any case y /2 R (1) . Why?
3.8. CENTERED AND UNCENTERED GOODNESS-OF-FIT MEASURES 40
this regression are defined as
(3.8.1) R2 ⌘ˆ
y
0M[1]ˆy
y
0M[1]y=
b
0X 0M[1]Xb
y
0M[1]y=
y
0P[X]M[1]P[X]y
y
0M[1]y
and
(3.8.2) R2u ⌘
ˆ
y
0ˆ
y
y
0y
=
b
0X 0Xb
y
0y
=
y
0P[X]y
y
0y
,
respectively. It is easy to prove that 0 R2u 1 and
R2u = 1� e
0e
y
0y
whether or not the unity vector 1 is included into X. In fact, since y = Xb+ e and X 0e = 0,
y
0y =
ˆ
y
0ˆ
y + e
0e. The same is not true for the centered measure. Indeed, 0 R2 1 and
(3.8.3) R2= 1� e
0e
y
0M[1]y
if and only if a) the constant is included, or b) all of the variables (y, X) have zero sample
mean, that is M[1]y = y and M[1]X = X. Clearly in the latter case, R2= R2
u.
A convenient property of the centered R-squared, when 1 is included into X, is that it
coincides with the squared simple correlation between y and ˆ
y, r2y,y, that is
(3.8.4) R2=
�
ˆ
y
0M[1]y�2
ˆ
y
0M[1]ˆyy0M[1]y
,
where R2 is defined in (3.8.1) and ˆ
y = Xb. Given the definition of R2 in (3.7.2), to prove
equation (3.8.4) boils down to proving ˆ
y
0M[1]y =
ˆ
y
0M[1]ˆy, which can be accomplished easily
3.8. CENTERED AND UNCENTERED GOODNESS-OF-FIT MEASURES 41
along the following lines. Since y =
ˆ
y + e, then
ˆ
y
0M[1]y =
ˆ
y
0M[1] (ˆy + e)
=
ˆ
y
0M[1]ˆy +
ˆ
y
0M[1]e
=
ˆ
y
0M[1]ˆy +
ˆ
ye
=
ˆ
y
0M[1]ˆy.
where the third equality follows from M[1]e = e, since the constant is included, and the last
from the orthogonality of ˆy and the OLS residuals.
This property is not shared by the uncentered R-squared, unless variables have zero sample
means.
3.8.1. A convenient formula for R2 when the constant is included (not covered
in class, but good and easy exercise). Partitioning X as X =
⇣
˜X 1
⌘
and using Lemma
12 gives P[X] = P[1] + P[
M[1]X], which replaced into (3.8.1) gives in turn
R2=
y
0⇣
P[1] + P[
M[1]X]
⌘
M[1]
⇣
P[1] + P[
M[1]X]
⌘
y
y
0M[1]y.
But
⇣
P[1] + P[
M[1]X]
⌘
M[1]
⇣
P[1] + P[
M[1]X]
⌘
=
P[1]M[1]P[1] + P[
M[1]X]M[1]P[1] +
P[1]M[1]P[
M[1]X]+ P
[
M[1]X]M[1]P
[
M[1]X]= P
[
M[1]X],
so that eventually
(3.8.5) R2=
y
0P[
M[1]X]y
y
0M[1]y,
3.8. CENTERED AND UNCENTERED GOODNESS-OF-FIT MEASURES 42
which proves at once that R2 defined in (3.8.1) can be also obtained as the uncentered R-
squared from the OLS regression of M[1]y on M[1]˜X, namely the OLS regression of y in
mean-residuals and ˜X in mean-residuals.
CHAPTER 4
The finite-sample statistical properties of OLS
4.1. Introduction
This chapter is on the finite-sample statistical properties of OLS applied to the LRM.
Finite-sample means that we focus on a fixed sample size n as opposed to n ! 1, a case that
will be covered in Chapter 7. We will learn under what assumptions on the LRM and in which
sense the estimator is optimal. We will also learn how to test linear restrictions on the model
parameters. Finally, we will study an important case of inaccuracy for the OLS, which is the
omitted-variables problem.
Results in this chapter are demonstrated through the do-file statistics_OLS.do using
the data-sets US_gasoline.dta and mus06data.dta (from Cameron and Trivedi 2009).
4.2. Unbiasedness
Under LRM.1-LRM.3, OLS is unbiased, that is E (b) = �.
This is proved as follows. From LRM.1, LRM.2 and OLS formula in (3.3.2)
(4.2.1) b = � +
�
X 0X��1
X 0".
From LRM.3, then,
E (b|X) = � +
�
X 0X��1
X 0E ("|X)
= �.
43
4.3. THE GAUSS-MARCOV THEOREM 44
Finally, using the law of iterated expectations
E (b) = EX [E (b|X)]
= EX [�]
= �.
Notice that unbiasedness does not follow if we replace LRM.3 with the weaker LRM.3b.
4.3. The Gauss-Marcov Theorem
Let’s work out the conditional and unconditional covariance matrices for OLS under
LRM.1-LRM.4.
I get started with V ar (b|X). Since,
V ar (b|X) = E⇥
(b� �) (b� �)0 |X⇤
,
then, given equation (4.2.1),
V ar (b|X) = Eh
�
X 0X��1
X 0""0X�
X 0X��1 |X
i
,
=
�
X 0X��1
X 0E⇥
""0|X⇤
X�
X 0X��1
= �2�
X 0X��1
,
where the last equality follows from LRM.3 and LRM.4.
Now I turn to V ar (b). Using the law of decomposition of variance
V ar (b) = EX [V ar (b|X)] + V arx [E (b|X)]
= �2EX
h
�
X 0X��1
i
,
where the last equality follows from E (b|X) = �.
4.3. THE GAUSS-MARCOV THEOREM 45
Next I prove that OLS has the “smallest”, in a sense that will be clarified soon, covariance
matrix in the class of linear unbiased estimator.
I define the following partial order in the space of the l ⇥ l symmetric matrices:
Definition 17. Given any two l ⇥ l symmetric matrices A and B, A is said “no smaller”
than B if and only if A�B is non-negative definite (n.n.d.)
Given the partial order of definition 17, OLS has the “smallest” conditional covariance
matrix in the class of linear unbiased estimators. This important result is universally known
as the Gauss-Marcov Theorem. To prove it, define the generic member of the foregoing class
as
bo = Cy
where C is a generic k ⇥ n matrix that depends only on the sample information in X and
such that CX = Ik (how would you explain the last requirement?). OLS is of course a
member of the class, with its own C equal to COLS = (X 0X)
�1X 0. It is not hard to prove
that V ar (bo|X) = �2CC 0. Define, now D = C � COLS , then DX = 0 and so
V ar (bo|X) = �2h
D +
�
X 0X��1
X 0i h
D0+X
�
X 0X��1
i
= �2�
X 0X��1
+ �2DD0
= V ar (b|X) + �2DD0.
Since �2DD0 is n.n.d, according to Definition 17, V ar (b|X) is “no greater” than the variance
of any linear unbiased estimator.
The natural question arises of whether the partial order of Definition 17 is of any rel-
evance in real-world applications. It is, since it readily translates into the total order of
real numbers, which is the domain of the variances of random scalars. Indeed, if A is
“no smaller” than B, then r
0(A�B) r � 0, for any conformable r. But then, according
to the Gauss-Marcov Theorem, we can say that any linear combination of the components
4.3. THE GAUSS-MARCOV THEOREM 46
of b, r
0b, has smaller conditional variance than r
0bo. Formally, the theorem implies that
r
0[V ar (bo|X)� V ar (b|X)] r � 0. Then, V ar (r0b|X) = r
0V ar (b|X) r and V ar (r0bo|X) =
r
0V ar (bo|X) r and hence V ar (r0bo|X) � V ar (r0b|X) . The importance of this hinges upon
the fact that in empirical applications we are interested in the linear combinations of popula-
tion coefficients, as in the following example, where it is shown that the estimates of individual
coefficients can always be expressed as specific linear combinations of the k components of the
estimators.
Example 18. On noticing that bi = r
0ib and boi = r
0ibo, i = 1, ..., k, where ri is the
k ⇥ 1 vector with all zero elements except the i.th entry, which equals unity, and given the
Gauss-Marcov Theorem, we conclude that V ar (boi|X) � V ar (bi|X) , i = 1, ..., k.
In general, we have that the OLS estimator of any linear combination r
0� is given by r
0b
and, as the foregoing discussion demonstrates, under LRM.1-LRM.4 r
0b is BLUE (you can
easily verify that E (r
0b) = r
0�).
Now we prove that the Gauss-Marcov Theorem extends to the unconditional variances.
From
V ar (bo|X) = V ar (b|X) + �2DD0,
it has
EX [V ar (bo|X)] = EX [V ar (b|X)] + �2EX�
DD0� ,
or
V ar (bo) = V ar (b) + �2EX�
DD0� ,
and since EX (DD0) is n.n.d., we can also state that the unconditional variance of OLS is “no
greater” than that of any linear unbiased estimator.
4.4. ESTIMATING THE COVARIANCE MATRIX OF OLS 47
4.4. Estimating the covariance matrix of OLS
Since �2 is unknown, so are V ar (b|X) and V ar (b). Unbiased estimators of V ar (b|X)
and V ar (b), therefore, require an unbiased estimator of �2. We now prove that, under LRM.1-
LRM.4, the residual sum of squares divided by the appropriate degrees of freedom correction,
s2 = e
0e/ (n� k), is one such estimator, namely E
�
s2�
= �2.
From e = M[X]y and LRM.1 it follows that e = M[X]". Hence,
E�
s2|X�
=
1
n� kE�
"0M[X]"|X�
.(4.4.1)
Since "0M[X]" is a scalar, "0M[X]" = tr "0M[X]" and so, by the permutation rule of the trace
of a matrix product, "0M[X]" = tr "0M[X]" = tr M[X]""0. Replacing the right hand side of the
foregoing equation into equation (4.4.1) yields
E�
s2|X�
=
1
n� kE�
tr M[X]""0|X
�
.
Then exploiting the fact that both trace and expectation are linear operators
E�
s2|X�
=
1
n� ktr E
�
M[X]""0|X
�
.
=
1
n� ktr M[X]E
�
""0|X�
=
�2
n� ktr M[X](4.4.2)
where the last equality follows from LRM.3 and LRM.4. Now, focus on tr M[X]:
tr M[X] = trh
In �X�
X 0X��1
X 0i
= tr In � tr�
X 0X��1
X 0X
= n� k,
4.5. EXACT TESTS OF SIGNIFICANCE WITH NORMALLY DISTRIBUTED ERRORS 48
and so the numerator and denominator in (4.4.2) simplify to have E�
s2|X�
= �2. Finally, by
the law of iterated expectations
E�
s2�
= �2.
With s2 at hand we can get an unbiased estimator for V ar (b). It is
\V ar (b) = s2�
X 0X��1
.
In fact
E⇣
\V ar (b)|X⌘
= �2�
X 0X��1
= V ar (b|X)
and since E⇣
\V ar (b)⌘
= EX
h
E⇣
\V ar (b)|X⌘i
by the law of iterated expectations, then
E⇣
\V ar (b)⌘
= �2EX
h
�
X 0X��1
i
= V ar (b) .
4.5. Exact tests of significance with normally distributed errors
Assume LRM. 5: "|X ⇠ N�
0,�2In�
. Since equation (4.2.1) holds,
b|X ⇠ N⇣
�,�2�
X 0X��1
⌘
.
Also, since e = M[X]", e|X ⇠ N�
0,�2M[X]
�
. Using a result in Rao (1973), it is also possible
to prove at once that b and e are also jointly normal with zero covariances, conditional on X.
Specifically, since0
@
b
e
1
A
=
0
@
�
0
1
A
+ �
0
@
(X 0X)
�1X 0
M[X]
1
A
"
�
then by (8a.2.9) in Rao (1973) it has0
@
b
e
1
A |X ⇠ N
2
4
0
@
�
0
1
A ,�2
0
@
(X 0X)
�1X 0
M[X]
1
A
⇣
X (X 0X)
�1 M[X]
⌘
3
5
4.5. EXACT TESTS OF SIGNIFICANCE WITH NORMALLY DISTRIBUTED ERRORS 49
or0
@
b
e
1
A |X ⇠ N
2
4
0
@
�
0
1
A ,
0
@
�2(X 0X)
�10
0 �2M[X]
1
A
3
5
Therefore, being normally distributed with zero conditional covariances, conditional on X, b
and e are also independent, conditional on X. This general result is important and therefore
stated as a theorem for future reference.
Theorem 19. Assume LRM.5. Then b and e are independent, conditional on X.
Exercise 20. Verify, by direct computation of Cov (b, e|X), that Cov (b, e|X) = 0k⇥n.
Solution: Since E (e|X) = 0 (verify), then1
Cov (b, e|X) = E⇥
(b� �) e0|X⇤
or
Cov (b, e|X) = Eh
�
X 0X��1
X 0""0M[X]|Xi
=
�
X 0X��1
X 0E⇥
""0|X⇤
M[X]
= �2�
X 0X��1
XM[X]
= 0k⇥n.
Exercise 21. Verify, by direct computation of V ar (e|X), that V ar (e|X) = �2M[X].
Exercise 22. Is V ar
2
4
0
@
b
e
1
A |X
3
5 non-singular? Why or why not?
1In general the matrix of conditional covariances between two random vectors x and y, conditional on z, is
E�[x� E (x|z)] [y � E (y|z)]0 |z
.
4.5. EXACT TESTS OF SIGNIFICANCE WITH NORMALLY DISTRIBUTED ERRORS 50
4.5.1. Testing single linear restrictions. Let (X 0X)
�1ii stand for the i.th main diagonal
element of (X 0X)
�1, then
bi|X ⇠ N⇣
�i,�2�
X 0X��1ii
⌘
and, given the properties of the normal distribution, bi can be standardized to have
(4.5.1)bi � �i
q
�2(X 0X)
�1ii
|X ⇠ N (0, 1) ,
i = 1, ..., k. Were �2 known, then the above statistics could be used to test hypotheses on �i,
Ho : �i = �⇤i , by replacing the unknown �i with �⇤
i , where �⇤i is a value of interest fixed by
the researcher. For example, to test Ho : �i = 0 one would use
biq
�2(X 0X)
�1ii
⇠ N (0, 1) .
The problem is, of course, that �2 is generally unknown and so the foregoing approach cannot
be used as it is. With some adjustment we can make it operational, though. Just replace �2
with s2 in the expression for the standardized bi to get
(4.5.2) ti =bi � �⇤
iq
s2 (X 0X)
�1ii
and then prove that ti has a t distribution with n� k degrees of freedom when �i = �⇤i . The
denominator term in expression (4.5.2) is the standard error estimate for coefficient bi.
First, notice that since s2 = e
0e/ (n� k) = "0M[X]"/ (n� k),
(4.5.3) (n� k)s2
�2=
⇣"
�
⌘0M[X]
⇣"
�
⌘
.
Now, consider the following distributional result
• if z ⇠ N (0, I), and A is a conformable idempotent matrix, then z
0Az ⇠ �2(p) with
p = rank (A) .
4.5. EXACT TESTS OF SIGNIFICANCE WITH NORMALLY DISTRIBUTED ERRORS 51
Since "/� ⇠ N (0, I) and M[X] is idempotent, then (n� k) s2/�2 is an idempotent quadratic
form in a standard normal vector and, in the light of the foregoing distributional result, has
a chi-squared distribution with degrees of freedom equal to rank�
M[X]
�
= n� k :
(n� k)s2
�2⇠ �2
(n� k) .
Since Theorem 19 holds, any function of b is independent of any function of e, conditionally
on X, hence (bi � �⇤i ) /
q
�2(X 0X)
�1ii and (n� k) s2/�2 are conditionally independent.
Further,
ti =bi � �⇤
iq
�2(X 0X)
�1ii
/
r
(n� k)s2
�2/ (n� k),
therefore, in the light of the following second distributional result
• if z ⇠ N (0, 1), x ⇠ �2(p) and z and x are independent, then z/
p
x/p has a t
distribution with p degrees of freedom
then, ti|X ⇠ t (n� k), i = 1, ..., k.
Finally, since the t distribution does not depend on the sample information and, specifically,
on X, then ti and any component of X are statistically independent, so that the above holds
also unconditionally, that is ti ⇠ t (n� k), i = 1, ..., k.
Often we wish to test hypotheses involving linear combinations of �, r0�, where r is a k⇥1
vector of known constants.
Example 23. If we are estimating a two-input Cobb-douglas production function and �1
and �2 are the product elasticities of the two inputs, the hypothesis of constant returns to
scale is clearly important, so that our null is �1 + �2 = 1.
In general we express the null involving a linear combination of population coefficients as
H0 : r
0� � q = 0, where q is a known constant. In the Cobb-Douglas example r = (1 1)
0 and
q = 1.
4.5. EXACT TESTS OF SIGNIFICANCE WITH NORMALLY DISTRIBUTED ERRORS 52
The OLS estimator for r0� is r0b, which is normally distributed conditional on X: r0b|X ⇠
Nh
r
0�,�2r
0(X 0X)
�1r
i
. Therefore we have the following t test
r
0b� q
q
s2r0 (X 0X)
�1r
⇠ t (n� k) ,
which can be used to test H0 : r
0� � q = 0.
4.5.1.1. Stata implementation. For the sake of exposition, I report here the regress output
already seen in Section 3.3.3.
.
_cons -67.58063 27.91068 -2.42 0.042 -131.9428 -3.218488 y .9792669 .031607 30.98 0.000 .9063809 1.052153 c Coef. Std. Err. t P>|t| [95% Conf. Interval]
Total 64972.1238 9 7219.12487 Root MSE = 8.193 Adj R-squared = 0.9907 Residual 537.004616 8 67.125577 R-squared = 0.9917 Model 64435.1192 1 64435.1192 Prob > F = 0.0000 F( 1, 8) = 959.92 Source SS df MS Number of obs = 10
. regress c y
. use "/Users/giovanni/didattica/Greene/dati/ch. 1/1_1.dta"
The OLS coefficient estimates, b, are displayed in the first column (labeled Coef.). Then,
the second column reports the standard error estimates peculiar to each OLS coefficients,
sei =q
s2 (X 0X)
�1ii , i = 1, ..., k. The third column reports the values of the t statistics for
the k null hypotheses �i = 0, i = 1, ..., k :
ti =bi
q
s2 (X 0X)
�1ii
.
The test is two-sided in that the alternative is H1 : �i > 0 or �i < 0. The fourth column
reports the so called p-value for the two-sided t-test. It is defined as the probability that
a t distributed random variable is more extreme than the outcome of ti in absolute value:
Pr [(t < � |ti|) [ (t > |ti|)] or more compactly Pr (|t| > |ti|) . Clearly, if the p-value is smaller
4.5. EXACT TESTS OF SIGNIFICANCE WITH NORMALLY DISTRIBUTED ERRORS 53
than the chosen size of the test (the Type I error) then ti falls for sure into the critical region
and we reject the null at the chosen size. In other words, the p-value indicates the lowest size of
the critical region (the lowest Type I error) we could have fixed to reject the null, given the test
outcome. In this sense, the p-value is more informative than critical values. In the regress
example, if we choose a critical region of 5% size, given that Pr (|t| > 2.42) = 0.042 < 0.05,
we can reject at 5% that the constant term is equal to zero, knowing that we could also have
rejected at, say, 4.5%, but not at 1%. A 1% size is smaller then the test p-value, which is
the minimum size allowing rejection, and for this reason we can’t reject at 1%. This is a clear
case of borderline significance, one which we could not have identified with such precision by
simply looking at the 5% critical values. On the other hand, the p-value for the coefficient
on y is virtually zero (as low as 0.000). This therefore indicates that no matter how much
conservative we are towards the null, we can reject it at any conventional level of significance
(conventional sizes, with an increasing degree of conservativeness are 10%, 5%, 1%) and also
at a less conventional 0.1% (since 0.001 > 0.000).
4.5.2. From tests to confidence intervals. Let us fix the ↵100% critical region for our
two-sided t test for the null H0 : �i = �⇤i against the alternative H1 : �i 6= �⇤
i and let ±t↵/2 be
the corresponding critical values: Pr��
t < �t↵/2�
[�
t > t↵/2�
= ↵. Then, the probability
of not rejecting the null when it is true is (1� ↵) . Formally,
Pr
⇢
�
�
�
�
bi � �⇤i
sei
�
�
�
�
< t↵/2
�
= Pr
⇢
�t↵/2 <bi � �⇤
i
sei< t↵/2
�
= Pr�
�seit↵/2 < bi � �⇤i < seit↵/2
= Pr�
bi � seit↵/2 < �⇤i < bi + seit↵/2
= (1� ↵) .
But�
bi � seit↵/2, bi + seit↵/2�
is a (1� ↵) 100% confidence interval for �i. This proves that the
(1� ↵) 100% confidence interval�
bi � seit↵/2, bi + seit↵/2�
contains all of the null hypotheses
�i = �⇤i that we cannot reject at ↵100%. So while a given t test is informative only for
4.5. EXACT TESTS OF SIGNIFICANCE WITH NORMALLY DISTRIBUTED ERRORS 54
the specific null it is testing, the confidence interval conveys to the researcher much more
information. The last column of the regress output reports the 95% confidence intervals for
each OLS coefficients.
Exercise 24. Your regression output for a given coefficient �i yields bi = �9.320 and
sei = 1.760. 1) Compute the t-statistic for the null H0 : �i = 0. 2) In your regression
n � k = 334, this implies that t0.025 = 1.967, where t0.025 : Pr (t > t0.025) = 0.025. Will you
reject or not H0 : �i = 0 against H1 : �i 6= 0 at a significance level of 5%? Why? 3) Given
your answer to Question 2, will you expect that 0 belongs to the 95% confidence interval for
�i? 4) Compute the 95% confidence interval for �i. On the basis of the information from the
confidence interval alone, do you reject H0 : �i = �6 against H1 : �i 6= �6 at 5%, why? 5)
Using only your answers to Question 4, can you assert that the p-value of that test is greater
than 0.05? Also, do you expect that the absolute value of the t statistic for H0 : �i = �6
be greater or smaller than 1.967, why? Verify your answer by directly computing the value of
the t statistic for H0 : �i = �6. 6) Consider now the test of H0 : �i 0 against H1 : �i > 0
with a 5% significance level. Is the critical level for this test equal to, smaller or greater than
1.967?
Answer: 1) ti = �5.295. 2) Reject, because 5.295 > 1.967. 3) No, since H0 : �i = 0
is rejected at 5%. 4) (�12.782,�5.858) . No, since �6 2 (�12.782,�5.858) . 5) Yes: since
H0 : �i = �6 is not rejected at 5%, then ti falls within the acceptance region and so the test
p-value> 0.05. Since ti falls within the acceptance region, the value of |ti| must be smaller
than t0.025 = 1.967. Indeed, ti = �1.886. 6) Smaller: since the test is one-sided, the critical
value is t0.05.
Exercise 25. Your regression output for a given coefficient �i yields bi = 6.668 with
sei = 3.577. The outcome of the t-test for H0 : �i = 0 against H1 : �i 6= 0 shows p-
value= 0.07. Can you reject the null at 10%? Can you at 5%?
4.5. EXACT TESTS OF SIGNIFICANCE WITH NORMALLY DISTRIBUTED ERRORS 55
4.5.3. Testing joint linear restrictions. We want to test jointly J linear restrictions:
H0 : R�� q = 0, where R and q are, respectively, a J ⇥ k matrix and a J ⇥ 1 vector of fixed
known constants and such that no rows in R can be obtained as a linear combination of the
others, that is R is of full row rank J .
Under the null,
Rb� q = R (b� �)
and so given LRM.5,
Rb� q|X ⇠ N⇣
0,�2R�
X 0X��1
R0⌘
.
Then, using the distributional result that
• given the p⇥ 1 random vector x ⇠ N (µ,⌃), then (x� µ)0⌃�1(x� µ) ⇠ �2
(p) ,
it has
W =
(Rb� q)
0h
R (X 0X)
�1R0i�1
(Rb� q)
�2|X ⇠ �2
(J) .
Again, �2 is not known and so W is unfeasible as a test for H0. We can go about as in the
previous section and replace �2 with s2. In addition, then, divide the result by J to get the
statistic
F =
(Rb� q)
0h
R (X 0X)
�1R0i�1
(Rb� q)
Js2
Now consider another distributional results
• Given two independent random scalar x1 ⇠ �2(p1) and x2 ⇠ �2
(p2), then (x1/p2) / (x1/p2) ⇠
F (p1, p2) .
It is not hard to see that the above result can be applied to F, since it can be reformulated
as the ratio of two conditionally independent chi-squared random variables corrected by their
own degrees of freedoms. In fact, at the numerator we have
(Rb� q)
0h
R (X 0X)
�1R0i�1
(Rb� q)
J�2
4.5. EXACT TESTS OF SIGNIFICANCE WITH NORMALLY DISTRIBUTED ERRORS 56
and at the denominator s2/�2. Conditional on X, the former is a function of b alone, while the
latter is a function of e alone. Therefore, in the light of Theorem 19 the two are conditionally
independent and so we can invoke the foregoing distributional result, to establish F |X ⇠
F (J, n� k).
As with the t statistic, since the F distribution does not depend on the sample information,
we have that the above holds unconditionally: F ⇠ F (J, n� k) .
When H0 is a set of J exclusion restrictions, then q = 0 and each row of R has all zero
elements except unity in the entry corresponding to the parameter that is set to zero. For
example, with three parameters �0= (�1 �2 �3) and two exclusion restrictions �1 = 0 and
�3 = 0, then J = 2, q0= (0 0) and
R =
0
@
1 0 0
0 0 1
1
A
so that H0 can be formulated as
0
@
1 0 0
0 0 1
1
A
0
B
B
B
@
�1
�2
�3
1
C
C
C
A
=
0
@
0
0
1
A .
The F -test can be always rewritten as a function of the residual sum of squares under the
unrestricted model, e
0e, and the residual sum of squares under the model with restrictions
imposed, say e
0⇤e⇤ :
F =
(e
0⇤e⇤ � e
0e) /J
e
0e/ (n� k)
.
This is proved for the case of exclusion restrictions by using Lemma 12.
4.5. EXACT TESTS OF SIGNIFICANCE WITH NORMALLY DISTRIBUTED ERRORS 57
Partition the sample regressor matrix as X = (X1 X2) and consider the F test for the set
of exclusion restrictions H0 : �2 = 0:
F =
b
02
h
�
X 02M[X1]X2
��1i�1
b2
k2s2
=
b
02X
02M[X1]X2b2
k2s2.
Now apply the FWL Theorem to F to have
F =
y
0M[X1]X2�
X 02M[X1]X2
��1X 0
2M[X1]X2�
X 02M[X1]X2
��1X 0
2M[X1]y
k2s2.
=
y
0M[X1]X2�
X 02M[X1]X2
��1X 0
2M[X1]y
k2s2.
The numerator of the right hand side of the foregoing equation can be written more compactly
as y
0P[
M[X1]X2]
y. Hence, by Lemma 12,
F =
y
0 �P[X] � P[X1]
�
y
k2s2
or, adding and subtracting In within parentheses,
F =
y
0 �M[X1] �M[X]
�
y
k2s2
=
(e
0⇤e⇤ � e
0e) /k2
s2.
It is not hard to prove that if the constant term is kept in both models, then
F =
�
R2 �R2⇤�
/J
(1�R2) / (n� k)
,
where R2 is R-squared from the unrestricted model and R2⇤ is the R-squared from the restricted
model.
4.6. THE GENERAL LAW OF ITERATED EXPECTATION 58
4.6. The general law of iterated expectation
The general form of the law of iterated expectations (LIE) can be stated as in Wooldridge
(2010), pp. 19-20.
LIE(scalar|vector): Given the random variable y and the random vectors w and x,
where x = f (w), then E (y|x) = E [E (y|w) |x].
Since the above result holds for any function f (·), x can just be any subvector of w, as the
following example shows.
Example 26. Consider w = (w1 w2 w3)0 and x = Aw, where
A =
0
@
1 0 0
0 1 0
1
A
then x = (w1 w2)0 and so by the general law of iterated expectations
E (y|w1 w2) = E [E (y|w1 w2 w3) |w1 w2] .
The law can, of course, be formulated in terms of conditional expectations of random
vectors.
LIE(vector|vector): Given the random vector y and the random vectors w and x,
where x = f (w), then E (y|x) = E [E (y|w) |x] , where
E (y|x) =
0
B
B
B
@
E (y1|x)...
E (yn|x)
1
C
C
C
A
and E (y|w) =
0
B
B
B
@
E (y1|w)
...
E (yn|w)
1
C
C
C
A
.
Remark 27. Notice that in the formulation of conditional expectations the way the con-
ditioning set is represented is just a matter of notational convenience. What matters are the
random scalars that enter the conditioning set and not the way they are organized therein.
For example, E (y|w1, w2 , w3 , w4) can just be equivalently expressed as E (y|w0) or E (y|w),
4.7. THE OMITTED VARIABLE BIAS 59
where w = (w1 w2 w3 w4)0, or E (y|W ) where
W =
0
@
w1 w3
w2 w4
1
A ,
or through any other organization of (w1, w2, w3, w4).
Given Remark 27 the general LIE can be formulated with conditional expectations having
the conditioning set organized in the form of random matrices rather than random vectors, as
follows.
LIE(vector|matrix): Given the random vector y and the random matrices W and X,
where X = f (W ), then E (y|X) = E [E (y|W ) |X].
Paralleling the consideration made above, since f (·) is a generic function, from LIE(vector|matrix)
follows a special LIE for the case in which X is a submatrix of W . Therefore, given W =
(W1 W2), by LIE(vector|matrix) we always have
E (y|W1) = E [E (y|W ) |W1]
and
E (y|W2) = E [E (y|W ) |W2] .
4.7. The omitted variable bias
If explanatory variables that are relevant in the population model, for some reasons, are not
included into the statistical model - they may be intrinsically latent, such as individual skills,
or the specific data-set in use do not report them, or also, although observed and available,
the researcher failed to account for them in the model specification - then our OLS estimator
may undergone what is known in the econometric literature as an omitted variable bias. Let’s
see when and why.
4.7. THE OMITTED VARIABLE BIAS 60
Assume that the population model is
y = x
0� + ".
with x and � both k ⇥ 1 vectors and P.1-P.4 satisfied and consider the RS mechanism
RS: There is a sample of size n from the population equation, such that the elements of
the sequence {(yi xi1 xi2 ...xik) , i = 1, ..., n} are independently identically distributed
(i.i.d.) random vectors.
So far we are in the classical regression framework, but now let x
0= (x
01 x
02) with x1 being
a k1 ⇥ 1 vector and x2 a k2 ⇥ 1 vector and k = k1 + k2 and maintain that x2 is latent
or, however, not included into the statistical model and let’s explore the implications on the
statistical model. P.1 implies that
(4.7.1) y = X1�1 + ⇣
where ⇣ = X2�2 + " is now the new n⇥ 1 vector of latent realizations. If X is of f.c.r., then
X1 is of f.c.r. as well. So, in a sense, LRM.1 and LRM.2 continue to hold. But, as far as
LRM.3 and LRM.4 are concerned, nothing can be said. Specifically, we do not know whether
E (⇣|X1) = 0 or V ar (⇣|X1) = &2In. The first consequence is that the OLS estimator for �1,
b1 =�
X 01X1
��1X 0
1y
is likely to be biased. Indeed, the bias can be easily derived as follows. Replacing the right
hand side of equation (4.7.1) into the OLS formula yields
b1 = �1 +�
X 01X1
��1X 0
1⇣
= �1 +�
X 01X1
��1X 0
1X2�2 +�
X 01X1
��1X 0
1".
4.7. THE OMITTED VARIABLE BIAS 61
Since RS holds, "i, i = 1, .., n, is conditional-mean independent of (x0i1 x
0i2) and statistically
independent of⇣
x
0j1 x
0j2
⌘
, j 6= i = 1, ..., n. Therefore, E ("|X) = 0, which implies that
E (b1|X) = �1 +�
X 01X1
��1X 0
1X2�2
and hence, by the law of iterated expectations, we have the unconditional bias
E (b1)� �1 = Eh
�
X 01X1
��1X 0
1X2�2
i
.(4.7.2)
There are two specific instances, however, in which the bias is zero.
The first instance is that analyzed in Greene (2008) when X 01X2 = 0k1⇥k2 . In this case
(X 01X1)
�1X 01X2�2 = 0 and so the bias in equation (4.7.2) becomes zero.
The second instance occurs if in the population E (x
02�2|x1) = 0, as I now show. Since in
the population E ("|x) = 0, then by the general law of iterated expectation also E ("|x1) = 0.
Hence, E (x
02�2 + "|x1) = 0, which along with RS yields E (⇣|X1) = 0. Therefore, the ⇣ vector
in model (4.7.1) behaves like a conventional error term that satisfies LRM.3. The upshot is
that b1 is unbiased.
The two situations are not related. Clearly, E (X2�2|X1) = 0 does not imply X 01X2 =
0k1⇥k2 . But also the converse is not true, and X 01X2 = 0k1⇥k2 may hold if E (X2�2|X1) 6= 0,
as shown by the following example.
Example 28. Let y = x1�1 + x2�2 + " with x1 a Bernoulli random variable:
Pr (x1 = 1) = ⇢ and Pr (x1 = 0) = 1� ⇢,
Let also x2 = 1 � x1. While E (x2�2|x1) = (1� x1)�2, x1x2 = 0 with probability one. In
this case, a random sampling of y, x1 and x2 from the population will yield X 01X2 = 0 and
E (X2�2|X1) 6= 0 with probability one.
Be that as it may, the foregoing two instances of unbiasedness constitute a narrow case,
and in general omitted variables will bring about bias and inconsistency in the coefficient
4.7. THE OMITTED VARIABLE BIAS 62
estimates. Solutions are typically given by proxy variables, panel data estimators and instru-
mental variables estimators. The first method is briefly described below, the classical panel
data estimators are pursued in Chapter 8, while IV methods are described in Chapter 10.
To conclude, observe that if relevant variables are omitted LRM.4 does not generally hold,
unless V ar (x02�2|x1) = &2 < +1.
Lemma 29. Given any two non-singular square matrices of the same dimension, A and
B, if A�B is n.n.d. then B�1 �A�1 is n.n.d.
The foregoing lemma signifies that in the space of non-singular square matrices of a given
dimension if A is “no smaller” than B, then A�1 is “no greater” than B�1. It is useful in
situations in which the difference of inverse matrices is more easily worked out than that of
the original matrices.
The following exercise asks you to think through the consequences of overfitting, namely
applying OLS to a statistical model with variables that are redundant in the population model.
Exercise 30. Assume the population model is
y = x
0� + "
with x and � both k ⇥ 1 vectors and P.1-P.4 satisfied. Assume also that the l ⇥ 1 vector z of
observable variables is available, such that rank [E (ww
0)] = k + l where w
0= (x
0z
0). Also,
assume E ("|x z) = 0 and V ar ("|x z) = �2, i.e. z is redundant in the population equation.
Finally assume there is a sample of size n from the population, such that the elements of the
sequence {(yi x0i z
0i) , i = 1, ..., n} are i.i.d. 1⇥ (1 + k + l) random vectors. Applying the usual
4.7. THE OMITTED VARIABLE BIAS 63
notation for the sample variables,
y
n⇥1=
0
B
B
B
B
B
B
B
B
B
B
@
y1...
yi...
yn
1
C
C
C
C
C
C
C
C
C
C
A
, Xn⇥k
=
0
B
B
B
B
B
B
B
B
B
B
@
x
01...
x
0i...
x
0n
1
C
C
C
C
C
C
C
C
C
C
A
, Zn⇥l
=
0
B
B
B
B
B
B
B
B
B
B
@
z
01...
z
0i...
z
0n
1
C
C
C
C
C
C
C
C
C
C
A
, "n⇥1
=
0
B
B
B
B
B
B
B
B
B
B
@
"1...
"i...
"n
1
C
C
C
C
C
C
C
C
C
C
A
,
answer the following questions. 1) Prove that the statistical model
y = X� + "
satisfies LRM.1-LRM.4 (of course, this proves that
(4.7.3) b =
�
X 0X��1
X 0y
is BLUE). 2) Prove that the overfitting strategy of regressing y on X and Z yields an unbiased
estimator for � and call it bofit. 3) Derive the covariance matrix of bofit. 4) Use lemma 29
and verify that, indeed, the conditional covariance matrix of bofit is “no smaller” than that of
b in (4.7.3). 5) A byproduct of the overfitting strategy is the l ⇥ 1 vector of OLS coefficients
for the variables in Z. Let’s call it c. Express c using the intermediate result in the FWL
theorem as
c =
�
Z 0Z��1
Z 0(y �Xbofit)
and prove that the overfitting residual vector eofit ⌘ y �Xbofit � Zc equals
eofit = M[
M[Z]X]M[Z]y.
6) Find an unbiased estimator for �2 based on eofit.
Answer: 1) Obvious, since in the population and the sampling mechanism we have all
we need for the statistical properties LRM.1-LRM.4 to be true. 2) This is proved at once by
4.7. THE OMITTED VARIABLE BIAS 64
noting that from RS and E ("|x z) = 0, E ("|X Z) = 0. 3) Prove that V ar ("|X Z) = �2I and
then prove that
V arh
�
X 0M[Z]X��1
X 0M[Z]y|X Zi
= �2�
X 0M[Z]X��1
.
4) Write X 0M[Z]X as X 0M[Z]X = X 0X �X 0P[Z]X and then verify you have all is needed to
invoke the lemma. 5) Easy, it’s just algebra: replace bofit and c into eofit ⌘ y�Xbofit �Zc
and rearrange. 6) First, use the formula of the overfitting residual vector derived in the
previous question, M[
M[Z]X]M[Z]y, to set up the estimator
s2 =y
0M[Z]M[
M[Z]X]M[Z]y
n� k � l.
Then, notice that
s2 ="0M[Z]M
[
M[Z]X]M[Z]"
n� k � l.
Finally, take the expectation of the trace s2, conditional on X and Z, and follow the same
steps as when proving unbiasedness of the standard estimator s2. In the derivation don’t
forget that M[Z] and M[
M[Z]X]are orthogonal projections.
4.7.1. The proxy variables solution. Assume for simplicity that there is only one
omitted variable x2 from the population equation
(4.7.4) y = x
01�1 + x2�2 + ".
where x1 is a k1 ⇥ 1 vector of observed regressors. Assume also that there is a l ⇥ 1 vector of
observed variables z such that the following assumptions hold.
(1) The z variables are redundant in the population equation, that is E (y|x z) = x
0�.
(2) Once conditioning on z, the omitted variable x2 and the included explanatory vari-
ables, x1, are independent in conditional-mean: E (x2|x1 z) = E (x2|z) . Also E (x2|z) =
z
0�.
4.8. THE VARIANCE OF AN OLS INDIVIDUAL COEFFICIENT 65
(3) rank
8
<
:
E
2
4
0
@
x1
z
1
A
(x
01 z
0)
3
5
9
=
;
= k1+ l. This is analogous to property P.2 in Chapter
1 and permits identification of coefficients in the proxy variable regression as we will
see below.
Assumption 2 implies that x2 can be written as
(4.7.5) x2 = z
0� + ⌘,
where ⌘ = x2�E (x2|x1 z), and hence E (⌘|x1 z) = 0. Replacing the right hand side of equation
(4.7.5) into the population equation (4.7.4) yields
(4.7.6) y = x
01�1 + z
0(�2�) + �2⌘ + ",
where E (�2⌘|x1 z) = 0. Also, by the redundancy assumption, E (y|x z) = x
0�, it follows that
E ("|x z) = 0 and so by the general LIE
E ("|x1 z) = E [E ("|x z) |x1 z] = 0.
It follows that E (�2⌘ + "|x1 z) = 0 and so, along P.1 and P.2 (given Assumption 3), also P.3
is satisfied for equation (4.7.6). With the following RS mechanism
RS(x1 z): There is a sample of size n from the population, such that the elements of
the sequence {(yi xi1 ...xik1 zi1 ...zil) , i = 1, ..., n} are independently identically dis-
tributed (i.i.d.) random vectors,
the resulting statistical model will satisfy LRM.1-LRM.3 and so yield unbiased OLS estimates.
4.8. The variance of an OLS individual coefficient
Suppose that attention is centered onto a given explanatory variable whose observations
are collected into the (n⇥ 1) column vector xi, and that there are k � 1 control variables
collected into the n ⇥ (k � 1) matrix X�i. Without loss of generality partition the (n⇥ k)
4.8. THE VARIANCE OF AN OLS INDIVIDUAL COEFFICIENT 66
regressor matrix as X = (X�i xi) and, correspondingly, the (k ⇥ 1) OLS vector as
b =
0
@
b�i
bi
1
A ,
where b�i is (k � 1)⇥ 1 and bi is a scalar. Maintain LRM.1-LRM.4, so that
V ar (bi|X) = �2�
X 0X��1ii
where (X 0X)
�1ii indicates the last entry onto the main diagonal of (X 0X)
�1.
My purpose here is to derive the formula for (X 0X)
�1ii . As usual when it comes to the
derivation of formulas regarding parts of the OLS vector, I invoke the Frisch-Waugh-Lovell
(FWL) Theorem. Hence,
bi =�
x
0iM[X�i
]xi��1
x
0iM[X�i
]y,
so, given
y = X�i��i + xi�i + ",
it has
bi =�
x
0iM[X�i
]xi��1
x
0iM[X�i
]
�
X�i��i + xi�i + "�
and consequently
bi = �i +�
x
0iM[X�i
]xi��1
x
0iM[X�i
]".
Finally
V ar (bi|X) = Eh
�
x
0iM[X�i
]xi��1
x
0iM[X�i
]""0M[X�i
]xi�
x
0iM[X�i
]xi��1 |X
i
=
�
x
0iM[X�i
]xi��1
x
0iM[X�i
]E�
""0|X�
M[X�i
]xi�
x
0iM[X�i
]xi��1
= �2�
x
0iM[X�i
]xi��1
=
�2
x
0iM[X�i
]xi,(4.8.1)
4.8. THE VARIANCE OF AN OLS INDIVIDUAL COEFFICIENT 67
which also proves that
(4.8.2)�
X 0X��1ii
=
1
x
0iM[X�i
]xi.
Equation (4.8.2) is a general algebraic result providing the formula for the generic i.th main
diagonal element of the inverse of any non-singular cross-product matrix X 0X. I have proved it
in quite a peculiar way, using a well-known and easy-to-remember econometric result! Above
all, I could get away without referring to the hard-to-remember result on the inverse of the
(2⇥ 2) partitioned matrix, which is instead the route followed by Greene (Theorem 3.4 in-
Greene (2008), p. 30) .
Exercise 31. Prove (4.8.2) using formula (A-74) for the inverse of the (2⇥ 2) partitioned
matrix in Greene (2008), p. 966.
4.8.1. The three determinants of V ar (bi|X) when 1 is a regressor. Now I get back
to V ar (bi|X) in equation (4.8.1)
V ar (bi|X) = �2�
x
0iM[X�i
]xi��1
and assume that X�i contains the n⇥ 1 unity vector 1, or X�i =
⇣
˜X�i 1
⌘
. Notice, now, that
M[X�i
]xi is the residual vector from the OLS regression of xi on X�i and so x
0iM[X�i
]xi is the
residual sum of squares for this regression. Since the unity vector is a column of X�i , the
coefficient of determination for this regression is
R2i = 1�
x
0iM[X�i
]xi
x
0iM[1]xi
,
from which we have that
x
0iM[X�i
]xi =�
1�R2i
�
x
0iM[1]xi
4.8. THE VARIANCE OF AN OLS INDIVIDUAL COEFFICIENT 68
and eventually2
V ar (bi|X) =
�2�
1�R2i
�
x
0iM[1]xi
Also, it has
x
0iM[1]xi =
nX
j=1
(xij � xi)2 ,
that is x
0iM[1]xi is the total variation in xi around its sample mean, xi. Therefore,
(4.8.3) V ar (bi|X) =
�2
�
1�R2k
�
Pnj=1 (xji � xi)
2 ,
with V ar (bi|X) increasing when
• other things constant, R2i increases, in words the correlation between xi and the
other regressors increases (this is the multicollinearity effect on the variance of the
OLS individual coefficient);
• other things constant, the total variation in xi,Pn
j=1 (xji � xi)2, decreases;
• other things constant, the regression variance increases.
2An alternative proof is the following. Given Lemma 12, M[X�i] = I � P[1] � P[M[1]X�i] and so
V ar (bi
|X) = �2hx
0i
⇣M[1] � P[M[1]X�i]
⌘x
i
i�1
or
V ar (bi
|X) = �2⇣x
0i
M[1]xi
� x
0i
P[M[1]X�i]xi
⌘�1
= �2⇣x
0i
M[1]xi
� x
0i
P[M[1]X�i]xi
⌘�1
= �2
"x
0i
M[1]xi
1�
x
0i
P[M[1]X�i]xi
x
0i
M[1]xi
!#�1
.
= �2 ⇥x
0i
M[1]xi
�1�R2
i
�⇤�1
where
R2i
⌘x
0i
P[M[1]X�i]xi
x
0i
M[1]xi
.
Given (3.8.5), R2i
is the centered R-squared for the regression of x
i
, on X�i
(or equivalently the uncentered
R-squared from the regression of M[1]xi
on M[1]X�i
).
4.8. THE VARIANCE OF AN OLS INDIVIDUAL COEFFICIENT 69
Multicollinearity is perfect when xi belongs to R⇣
˜X�i
⌘
. In this case R2i = 1 (see Section
3.7) and the variance of bi diverges to infinity. Coefficient �i cannot be estimated given the
available data (X is not of f.c.r. in this case).
Remark 32. Multicollinearity, when it does not degenerate into perfect multicollinearity,
i.e. det (X 0X) = 0, does not affect the finite sample properties of OLS. Nonetheless, it may
severely reduce the precision of our estimates, in terms of larger standard errors and confidence
intervals.
Exercise 33. Partition X as X =
⇣
˜X 1
⌘
and accordingly, the OLS (k ⇥ 1) vector as
b =
⇣
˜
b
0 b0⌘0
, where ˜
b is of dimension (k � 1)⇥ 1 and b0 is the OLS estimator of the constant
term �0. Prove that
b0 = y � ˜
x
0˜
b
where y is the sample mean of y and ˜
x is the (k � 1) ⇥ 1 vector of sample means for the ˜X
regressors (hint: just use the intermediate result from the proof of the FWL Theorem that
b1 = (X 01X1)
�1X 01 (y �X2b2)).
Exercise 34. Use exercise 33 and the following three facts:
1) V ar (y|X) = En
[y � E (y|X)]
2 |Xo
= E�
"2|X�
=
�2
n,
2) V ar⇣
˜
x
0˜
b|X⌘
=
˜
x
0V ar
⇣
˜
b|X⌘
˜
x
4.9. RESIDUALS FROM PARTITIONED OLS REGRESSIONS 70
and
3) cov⇣
y, ˜x0˜
b|X⌘
= E
(y � E (y|X))
⇣
˜
x
0˜
b� E⇣
˜
x
0˜
b|X⌘⌘0
|X�
= E
""0M[1]˜X⇣
˜X 0M[1]˜X⌘�1
˜
x|X�
= E
1
n1
0""0M[1]˜X⇣
˜X 0M[1]˜X⌘�1
˜
x|X�
=
�2
n1
0M[1]˜X⇣
˜X 0M[1]˜X⌘�1
˜
x = 0,
to prove that
V ar (b0|X) =
�2
n+
˜
x
0V ar
⇣
˜
b|X⌘
˜
x.
4.9. Residuals from partitioned OLS regressions
Consider the partitioned OLS regression of M[X2]y on the columns in M[X2]X1 as regressors
and the corresponding OLS residual vector
u = M[X2]y �M[X2]X1b1.
I now prove that u is equal to the OLS residual vector e ⌘ M[X]y. Since b1 =�
X 01M[X2]X1
��1X 0
1M[X2]y,
replacing it into the right hand side of the u equation yields
u = M[X2]y �M[X2]X1�
X 01M[X2]X1
��1X 0
1M[X2]y
= M[X2]y � P[
M[X2]X1]
y.
By Lemma 12 it has P[X] = P[X2] + P[
M[X2]X1]
and so M[X] = M[X2] � P[
M[X2]X1]
or M[X2] =
M[X] + P[
M[X2]X1]
. Then
u =
⇣
M[X] + P[
M[X2]X1]
⌘
y � P[
M[X2]X1]
y
= M[X]y
⌘ e.
CHAPTER 5
The Oaxaca’s model: OLS, optimal weighted least squares and
group-wise heteroskedasticity
5.1. Introduction
The Oaxaca’s model is a good way to check your comprehension of things so far. The
treatment is more complete than Greene (2008)’s. Importantly, it serves as a motivation of the
Zyskind’s condition, introduced in Section 5.4. It may also serve as an introduction to a number
of topics that will be covered later on: in particular, dummy variables; heteroskedasticity;
generalized least squares estimation; panel data models. Do Exercise 39 at the end.
5.2. Embedding the Oaxaca’s model into a pooled regression framework
We have 2 statistically independent samples, not necessarily of equal sizes. 1) A sample
from the population of male workers, with observations for the log(wage), collected into the
(nm ⇥ 1) column vector ym, and socio-demographic explanatory variables collected into the
(nm ⇥ k) sample regressor matrix Xm; 2) a sample from the population of female workers with
the same variables collected into the (nf ⇥ 1) column vector yf and the (nf ⇥ k) matrix Xf ,
respectively.
Assume that both population models
ym = Xm�m + "m
yf = Xf�f + "f
meet LRM.1-LRM.5 with regression variances, �2m = �2!2
m and �2f = �2!2
f , not necessar-
ily equal (group-wise heteroskedasticity). Hence, the resulting OLS estimators from the two71
5.2. EMBEDDING THE OAXACA’S MODEL INTO A POOLED REGRESSION FRAMEWORK 72
separate regressions, bm and bf , are independently normally distributed, with bm|Xm ⇠
Nh
�m,�2m (X 0
mXm)
�1i
and bf |Xf ⇠ N
�f ,�2f
⇣
X 0fXf
⌘�1�
and are both BLUE.
The question I ask is whether it is possible to embed the two models into a single regression
model by pooling the two sub-samples into a larger one with size equal to n = nm + nf , and
continue to estimate �m and �f efficiently.
Let’s try and see. Here is the pooled data-set
y =
0
@
ym
yf
1
A , Xw =
0
@
Xm
Xf
1
A , " =
0
@
"m
"f
1
A .
Let 1 denote the (n⇥ 1) vector of all unity elements and construct the (n⇥ 1) vector d, such
that its first nm entries have all unity elements and the last nf all zero elements.
Variables like d are usually referred to as dummy variables or indicator variables, since
they indicate whether any observation in the sample belongs or not to a given group. In this
particular case, d is the male dummy variable indicating whether any observation in the sample
is specific to the male group. Since the two groups are mutually exclusive, the female dummy
variable, indicating whether any observation in the sample belongs to the female group, can be
constructed as the complementary vector 1�d. By construction, d and 1�d are orthogonal,
that is d
0(1� d) = 0.
Let x
0wi be a (1⇥ k) row vector indicating the i.th row of Xw and yi, "i and di be scalars
indicating the i.th component of y, " and d, respectively.
With this in hand, the model for the generic worker i = 1, ..., n is
(5.2.1) yi = dix0wi�m + (1� di)x
0wi�f + "i
On setting up the (2k ⇥ 1) parameter vector � as
� =
0
@
�m
�f
1
A
5.2. EMBEDDING THE OAXACA’S MODEL INTO A POOLED REGRESSION FRAMEWORK 73
and the (n⇥ 2k) regressor matrix X as
X =
0
@
Xm 0(nm
⇥k)
0
(
nf
⇥k)
Xf
1
A ,
where 0(s⇥t) indicate a (s⇥ t) matrix of all zero elements, model (5.2.1) can be reformulated
in matrix form as
(5.2.2) y = X� + ".
Exercise 35. Prove that X has f.c.r. if and only if both Xm and Xf have f.c.r.
Summing up, we have two equivalent representations of the same model: 1) that in Greene
(2008), with the two separate regressions; 2) that presented here with a single regression model,
represented by (5.2.2). It turns out that the two frameworks are equivalent, as far as efficient
estimation of the population coefficients is concerned. Indeed, as I prove next, the OLS
estimator, b, from model (5.2.2) is numerically identical to the OLS estimators from the two
separate regressions as presented in Greene (2008), i.e. b =
⇣
b
0m b
0f
⌘0. Let
b =
�
X 0X��1
X 0y.
By construction,
X 0y =
0
@
X 0mym
X 0fyf
1
A
and
X 0X =
0
@
X 0mXm 0(k⇥k)
0(k⇥k) X 0fXf
1
A .
Then, by a well know property of the inverse of a block diagonal matrix (see (A-73) in Greene
(2008))
�
X 0X��1
=
0
@
(X 0mXm)
�10(k⇥k)
0(k⇥k)
⇣
X 0fXf
⌘�1
1
A .
5.2. EMBEDDING THE OAXACA’S MODEL INTO A POOLED REGRESSION FRAMEWORK 74
Hence,
b =
0
@
(X 0mXm)
�10(k⇥k)
0(k⇥k)
⇣
X 0fXf
⌘�1
1
A
0
@
X 0mym
X 0fyf
1
A
=
0
@
(X 0mXm)
�1X 0mym
⇣
X 0fXf
⌘�1X 0
fyf
1
A
=
0
@
bm
bf
1
A .
Exercise 36. Prove that
b =
0
@
bm
bf
1
A
using the FWL Theorem.
It must be pointed out that model (5.2.2) does not satisfy assumption LRM.4. The dis-
turbances ", although independently distributed, suffer from what is usually referred to as
group-wise heteroskedasticity, as the model does not maintain �2m = �2
f . Indeed, the covari-
ance matrix for " is
⌃ = �2
0
B
@
!2mIn
m
0
(
nm
⇥nf
)
0
(
nf
⇥nm
)
!2fInf
1
C
A
.
In this sense, model (5.2.2) is not a classical regression model. Does this mean that b is not
BLUE? No, and for an important reason. Assumptions LRM.1-LRM.4 are sufficient for the
OLS estimator to be BLUE, as it has been proved in Section 4.3, but not necessary. In specific
circumstances, even if LRM.4 is not met, the OLS estimator is still BLUE, and the Oaxaca’s
model is one such case. This is verified in the next section. A general necessary and sufficient
condition for the OLS to be BLUE is postponed to the last section of this tutorial.
5.3. THE OLS ESTIMATOR IN THE OAXACA’S MODEL IS BLUE 75
5.3. The OLS estimator in the Oaxaca’s model is BLUE
Model (5.2.2) can be transformed into a classical regression model by using a standard
procedure in econometrics and statistics: weighting. Let
H =
0
B
@
!�1m In
m
0
(
nm
⇥nf
)
0
(
nf
⇥nm
)
!�1f In
f
1
C
A
.
As stated by the exercise below, the matrix H when premultiplied to any conformable
vector transforms the vector so that its first nm elements get divided by !m and the remaining
by !f . This is what we refer to as “weighting”.
Exercise 37. Verify by direct inspection that, given any (nm ⇥ 1) vector xm, any (nf ⇥ 1)
vector xf and
x =
0
@
xm
xf
1
A ,
then
Hx =
0
@
!�1m xm
!�1f xf
1
A .
Premultiply both sides of model (5.2.2) by H:
Hy = HX� +H",
or
(5.3.1) e
y =
˜X� +
˜"
where the tilde indicates weighted variables. Two important facts are worth observing at this
point. First, the population parameters vector, �, in the weighted model is the same as in
model (5.2.2). Second, the weighted errors satisfy LRM.4 with covariance matrix equal to
5.3. THE OLS ESTIMATOR IN THE OAXACA’S MODEL IS BLUE 76
�2In, (so, if LRM.5 holds they are independent standard normal variables) since
V ar⇣
e"| eX⌘
= H⌃H 0= H⌃H
=
0
B
@
!�1m In
m
0
(
nm
⇥nf
)
0
(
nf
⇥nm
)
!�1f In
f
1
C
A
0
B
@
�2mIn
m
0
(
nm
⇥nf
)
0
(
nf
⇥nm
)
�2fInf
1
C
A
0
B
@
!�1m In
m
0
(
nm
⇥nf
)
0
(
nf
⇥nm
)
!�1f In
f
1
C
A
= �2
0
B
@
!mInm
0
(
nm
⇥nf
)
0
(
nf
⇥nm
)
!fInf
1
C
A
0
B
@
!�1m In
m
0
(
nm
⇥nf
)
0
(
nf
⇥nm
)
!�1f In
f
1
C
A
= �2In.
Therefore, the weighted model is a classical regression model that identifies the parameters of
interest, and hence, by the Gauss-Marcov Theorem, the OLS estimator applied to the weighted
model (5.3.1), referred to as the weighted least squares estimator (WLS), bw, is BLUE for �.
Let us work out its formula, using exercise 37:
bw =
0
@
�
!�2m X 0
mXm��1
0(k⇥k)
0(k⇥k)
⇣
!�2f X 0
fXf
⌘�1
1
A
0
@
!�2m X 0
mym
!�2f X 0
fyf
1
A
=
0
@
!2m (X 0
mXm)
�10(k⇥k)
0(k⇥k) !2f
⇣
X 0fXf
⌘�1
1
A
0
@
!�2m X 0
mym
!�2f X 0
fyf
1
A
=
0
@
(X 0mXm)
�1X 0mym
⇣
X 0fXf
⌘�1X 0
fyf
1
A ,
which proves that b = bw, namely that in the Oaxaca’s models the OLS estimator coincides
with the optimal WLS estimator.
Does this imply that we can do inference in the Oaxaca’s model feeding the Stata regress
command with the variables of model (5.2.2) without further cautions? Not quite. Although
5.4. A GENERAL RESULT 77
the single OLS regression provides the BLUE estimator for the population coefficients �, the
OLS estimate of V ar (b|X) that would be computed by regress,
˜V ar (b|X) = s2
0
@
(X 0mXm)
�10(k⇥k)
0(k⇥k)
⇣
X 0fXf
⌘�1
1
A ,
with s2 obtained from the sum of squares of the pooled residuals, is biased. The reason is
that ˜V ar (b|X) forces the regression variance estimate to be constant across the two samples.
Luckily, the same is not true for the separate regressions on the two subsamples, providing us
with the unbiased estimators of the model coefficients, bm and bf and the unbiased estimator
of the covariance matrix
ˆV ar (b|X) =
0
@
s2m (X 0mXm)
�10(k⇥k)
0(k⇥k) s2f
⇣
X 0fXf
⌘�1
1
A
where s2m =
1nm
�k
Pnm
i=1 e2i and s2f =
1nf
�k
Pni=n
m
+1 e2i . Alternatively, one can implement
a feasible version of the weighted regression explained above, using sm and sf as weights.
But this is clearly more computationally cumbersome than carrying out the two separate
regressions.
5.4. A general result
Zyskind (1967) provides a general necessary and sufficient condition for the OLS estimator
to be BLUE.
Theorem 38. Given the regressor matrix, X, and the conditional covariance matrix ,
V ar ("|X) = ⌃, the OLS estimator, b = (X 0X)
�1X 0y, is BLUE if and only if P[X]⌃ = ⌃P[X].
If LRM.4 holds, that is ⌃ = �2In, Zyskind’s condition holds for any X, since
P[X]⌃ = P[X]In�2= �2P[X].
5.4. A GENERAL RESULT 78
That Zyskind’s condition is also verified in the Oaxaca’s model is straightforwardly proved
upon elaborating P[X] :
P[X] = X�
X 0X��1
X 0
=
0
@
Xm 0(nm
⇥k)
0
(
nf
⇥k)
Xf
1
A
0
@
(X 0mXm)
�10(k⇥k)
0(k⇥k)
⇣
X 0fXf
⌘�1
1
A
0
@
X 0m 0
(
k⇥nf
)
0(k⇥nm
) X 0f
1
A
=
0
@
Xm (X 0mXm)
�10(n
m
⇥k)
0
(
nf
⇥k)
Xf
⇣
X 0fXf
⌘�1
1
A
0
@
X 0m 0
(
k⇥nf
)
0(k⇥nm
) X 0f
1
A
=
0
B
@
Xm (X 0mXm)
�1X 0m 0
(
nm
⇥nf
)
0
(
nf
⇥nm
)
Xf
⇣
X 0fXf
⌘�1X 0
f
1
C
A
=
0
B
@
P[Xm
] 0
(
nm
⇥nf
)
0
(
nf
⇥nm
)
P[
Xf
]
1
C
A
Therefore,
⌃P[X] =
0
B
@
�2mIn
m
0
(
nm
⇥nf
)
0
(
nf
⇥nm
)
�2fInf
1
C
A
0
B
@
P[Xm
] 0
(
nm
⇥nf
)
0
(
nf
⇥nm
)
P[
Xf
]
1
C
A
=
0
B
@
�2mP[X
m
] 0
(
nm
⇥nf
)
0
(
nf
⇥nm
)
�2fP
[
Xf
]
1
C
A
=
0
B
@
P[Xm
] 0
(
nm
⇥nf
)
0
(
nf
⇥nm
)
P[
Xf
]
1
C
A
0
B
@
�2mIn
m
0
(
nm
⇥nf
)
0
(
nf
⇥nm
)
�2fInf
1
C
A
.
= P[X]⌃.
As a final remark, remind that the Zyskind condition ensures only that OLS coefficients are
BLUE, saying nothing about the properties of the OLS standard error estimates and indeed
we have seen in the previous sections that they may be biased even if b is BLUE. The following
exercise on partitioning provides another instance of such occurrence.
5.4. A GENERAL RESULT 79
Exercise 39. Consider the partitioned regression
(5.4.1) y = X1�1 +X2�2 + ",
maintaining LRM.1-LRM.4. 1) Verify that premultiplying both sides of the foregoing equation
by M[X2] boils down to the reduced regression model
(5.4.2) ˜
y =
˜X1�1 + ˜"
where
˜
y = M[X2]y,˜X1 = M[X2]X1 and ˜" = M[X2]".
2) How can you interpret the variables in model (5.4.2)? 3) As far as �1 is concerned, does
OLS applied to model (5.4.2) yields the same estimator as OLS applied to model (5.4.1), why
or why not? 4) Does the reduced model (5.4.2) satisfy LRM.1-LRM.4? Which ones, if any, are
not satisfied? 5) The degrees of freedom of the reduced regression are n � k1. Do you think
that the resulting OLS estimate for �2 would be unbiased? 6) Verify that the reduced model
(5.4.2) satisfies the Zyskind condition.
Solution. 1) It does since M[X2]X2 = 0. 2) The variables ˜
y and ˜X1 in model (5.4.2) are
the residuals from k1 + 1 separate regressions using y and each column of X1 as dependent
variables and the columns of X2 as regressors. 3) Yes, by the FWL Theorem. 4) LRM.1-LRM.3
are met, but LRM.4 fails with
V ar⇣
˜"| ˜X1
⌘
= �2M[X2].
5) It is biased since we know that the unbiased OLS estimator uses n� k degrees of freedom
to correct the OLS residual sum of squares (which is nonetheless the same for both models
(5.4.1) and (5.4.2), as we learn from Section 4.9). 6) You have just to verify that
M[X2]P[
M[X2]X1]
= P[
M[X2]X1]
M[X2],
5.4. A GENERAL RESULT 80
which is readily done by noting that M[X2], symmetric and idempotent, is the first and the
last factor in P[
M[X2]X1]
.
The within regression examined in Chapter 8 (equation (8.2.7)) is a special case of model
(5.4.2) in the foregoing exercise.
CHAPTER 6
Tests for structural change
6.1. The Chow predictive test
There is a time series data-set with T = T1 + T2 observations. X denotes the (T ⇥ k)
regressor matrix of full column rank, y is the (T ⇥ 1) vector of observations for the dependent
variable and " is the error vector, which is normally distributed, given X, with zero mean and
constant variance �2. Assume also that T1 > k. It is worth noting right now that we are not
assuming T2 > k, so that the test we introduce here, differently from the Classical Chow test,
goes through also in the case in which the second subsample does not permit identification of
the (k ⇥ 1) vector of population coefficients �.
Partition y, X and " row-wise
y =
0
@
y1
y2
1
A , X =
0
@
X1
X2
1
A and " =
0
@
"1
"2
1
A ,
so that the top blocks of the partitioned matrices contain the first T1 observations and those
in the bottom contain the last T2 observations. The model under the null hypothesis is just
the classical regression model
(6.1.1) y = X� + ".
and the OLS estimator for � is
b⇤ =�
X 0X��1
X 0y.
The structural break is thought of as time-specific additive shocks hitting the model over the
second subsample. Therefore, the model in the presence of the structural break is formulated
81
6.1. THE CHOW PREDICTIVE TEST 82
as
y1 = X1� + "1
y2 = X2� + IT2� + "2,(6.1.2)
where IT2 is the identity matrix of order T2 and � is the (T2 ⇥ 1) vector of time-specific
shocks. Written in a more compact form where 0(l⇥m) indicates a (l ⇥m) null matrix, the
general model becomes0
@
y1
y2
1
A
=
0
@
X1 0(T1⇥T2)
X2 IT2
1
A
0
@
�
�
1
A
+
0
@
"1
"2
1
A ,
from which it is clear that model (6.1.2) uses a larger regressor matrix than model (6.1.1),
including also the time dummies specific to the observations over the second subsample
D =
0
@
0(T1⇥T2)
IT2
1
A
and that the time specific shocks � are the coefficients on those time dummies. The OLS
estimator for � and � can then be expressed as0
@
b
c
1
A
=
�
W 0W��1
W 0y,
where W = (X D) is the extended regressor matrix. Therefore, the null hypothesis can be
formally expressed as H0 : � = 0(T2⇥1) and can be tested by a standard F-test of joint
significance
(6.1.3) F =
(e
0⇤e⇤ � e
0e) /T2
e
0e/ (T1 � k)
⇠ F (T2, T1 � k) ,
6.1. THE CHOW PREDICTIVE TEST 83
where e⇤ = y �Xb⇤ indicate the OLS residual (T ⇥ 1) vector from the model under the null
and
(6.1.4) e = y �W
0
@
b
c
1
A
is the OLS residual (T ⇥ 1) vector from the unrestricted model1.
It is possible to prove that the F-ratio in (6.1.3) equals
(6.1.5) F =
(e
0⇤e⇤ � e
01e1) /T2
e
01e1/ (T1 � k)
,
where e1 = y1 �X1b1 denotes the OLS residual (T1 ⇥ 1) vector obtained by regressing y1 on
X1 and accordingly
(6.1.6) b1 =�
X 01X1
��1X1y1.
Differently from Greene (2008) I prove this result by using the FWL Theorem. We just need
to work out e and to do this I obtain the separate expressions for b and c. Let’s get started
with b. By the FWL Theorem
b =
�
X 0M[D]X��1
X 0M[D]y.
Expand M[D] to obtain
M[D] = In �
0
@
0(T1⇥T2)
IT2
1
A
2
6
4
0
@
0(T1⇥T2)
IT2
1
A
00
@
0(T1⇥T2)
IT2
1
A
3
7
5
�10
@
0(T1⇥T2)
IT2
1
A
0
.
The matrix in brackets in the foregoing expression reduces to
0(T2⇥T2) + IT2 = IT2 .
1The degrees-of-freedom correction in the denominator of the F ratio follows from the fact that the number of
estimated parameters under the alternative is k + T2.
6.1. THE CHOW PREDICTIVE TEST 84
Therefore,
M[D] = In �
0
@
0(T1⇥T2)
IT2
1
A
0
@
0(T1⇥T2)
IT2
1
A
0
= In �
0
@
0(T1⇥T1) 0(T1⇥T2)
0(T2⇥T1) IT2
1
A .(6.1.7)
The second matrix in equation (6.1.7) transforms any conformable vector it premultiplies
so that its first T1 values are replaced by zeroes and its last T2 values are left unchanged.
Accordingly, M[D] carries out the complement operation, transforming any conformable vector
to which it is premultiplied in a way that its first T1 values remain unchanged and the last T2
values get replaced by zeroes. Therefore,
M[D]X =
0
@
X1
0(T2⇥k)
1
A ,
implying that b = b1, defined in equation (6.1.6).
Turning to c, by the FWL Theorem (equation (3.6.3))
c =
�
D0D��1
D0(y �Xb)
and since b = b1 and D0D = IT2 ,
c =
0
@
0(T1⇥T2)
IT2
1
A
0
(y �Xb1)
=
0
@
0(T1⇥T2)
IT2
1
A
00
@
y1 �X1b1
y2 �X2b1
1
A
= y2 �X2b1.(6.1.8)
6.2. AN EQUIVALENT REFORMULATION OF THE CHOW PREDICTIVE TEST 85
Therefore, the OLS coefficients c are the predicted residuals for the second subsample by
means of the estimates b1, obtained from the first subsample. Finally, replacing the right
hand sides of equations (6.1.6) and (6.1.8) into equation (6.1.4) yields
e = y �Xb1 �D (y2 �X2b1)
=
0
@
y1
y2
1
A�
0
@
X1b1
X2b1
1
A�
0
@
0(T1⇥T2)
y2 �X2b1
1
A
=
0
@
e1
0(T2⇥1)
1
A ,
which proves that e
0e = e
01e1 and consequently the F test expression of equation (6.1.5).
Remark 40. Since nothing of the foregoing derivation hinges upon the fact that the T2
observations are contiguous in the sample (the OLS estimator and residuals are invariant to
permutations of the rows), there is a more general lesson that can be learnt here. Regardless of
the data-set format, be it a time series, a cross-section or a panel data, extending the matrix of
regressors to dummy variables that indicate each a single observation will actually exclude all
involved observations from the estimation sample. Therefore, the above procedure can be used
both to test that given observations in the sample are not outliers and, in case of rejection,
to neglect the outliers from the estimation sample, without materially removing records from
the data-set.
6.2. An equivalent reformulation of the Chow predictive test
Here we stress the interpretation of the Chow predictive test as a test of zero prediction
errors in the period with insufficient observations. In doing this, we reformulate the test using
the formula based on the Wald statistic.
From equation (6.1.8) derived in the previous section we have
c = y2 �X2b1.
6.2. AN EQUIVALENT REFORMULATION OF THE CHOW PREDICTIVE TEST 86
Therefore c, the OLS estimator of �, can be thought of as the prediction error undergone
when we use the first sub-period estimates to predict y in the second sub period, given X2.
If the elements in c are not significantly different from zero jointly, then we have evidence for
not rejecting the null hypothesis of parameter constancy (zero � idiosyncratic shocks). The
Chow predictive test in (6.1.5)
F =
(e
0⇤e⇤ � e
01e1) /T2
e
01e1/ (T1 � k)
assesses exactly this in a rigorous way, since F ⇠ F (T2, T1 � k) under the null hypothesis and
normal ".
As usual, F can be rewritten as the Wald statistic divided by the number of restrictions
(T2 in this case):
(6.2.1) F =
c
0 \V ar (c|X)
�1c
T2
(c is the discrepancy vector and \V ar (c|X) is the OLS estimator of V ar (c|X) ). The F
formula in (6.2.1) can be made operational by elaborating the conditional covariance matrix
of the prediction error, V ar (c|X) , under the null hypothesis of the test, H0 : � = 0.
If � = 0, then y2 = X2� + "2 and b1 = � + (X 01X1)
�1X 01"1, and hence
c = X2� + "2 �X2
h
� +
�
X 01X1
��1X 0
1"1i
= "2 �X2�
X 01X1
��1X 0
1"1.
Therefore, under H0
V ar (c|X) = E
⇣
"2 �X2�
X 01X1
��1X 0
1"1⌘⇣
"2 �X2�
X 01X1
��1X 0
1"1⌘0
|X�
= E�
"2"02|X
�
+ Eh
X2�
X 01X1
��1X 0
1"1"01X1
�
X 01X1
��1X 0
2|Xi
= �2h
IT2 +X2�
X 01X1
��1X 0
2
i
,
6.3. THE CLASSICAL CHOW TEST 87
where the second equality follows from the assumption of spherical disturbances over the whole
sample,
V ar
2
4
0
@
"1
"2
1
A |X
3
5
= �2IT .
Hence,\V ar (c|X) = s21
h
IT2 +X2�
X 01X1
��1X 0
2
i
,
where s21 = e
01e1/ (T1 � k) . This implies that the predictive F test can be also computed as
F =
(y2 �X2b1)0h
IT2 +X2 (X01X1)
�1X 02
i�1(y2 �X2b1)
T2s21.
6.3. The classical Chow test
Now both T1 > k and T2 > k, so that there is enough information in both subsamples to
identify two subsample specific k⇥1 beta vectors, �1 and �2, and base a parameter constancy
test on H0 : �1 = �2.
As before, X denotes the (T ⇥ k) regressor matrix of full column rank, y is the (T ⇥ 1)
vector of observations for the dependent variable and " is the error vector, which is nor-
mally distributed, given X, with zero mean and covariance matrix �2IT . The usual row-wise
partitioning holds:
y =
0
@
y1
y2
1
A , X =
0
@
X1
X2
1
A and " =
0
@
"1
"2
1
A ,
so that the top blocks of the partitioned matrices contain the first T1 observations and those in
the bottom contain the last T2 observations. Assume that both X1 and X2 have f.c.r. Notice
that this is not ensured by f.c.r. for X as the following example show.
6.3. THE CLASSICAL CHOW TEST 88
Example 41. The matrix
X =
0
B
B
B
B
B
B
B
B
B
B
@
1 2
1 2
1 2
1 0
1 0
1
C
C
C
C
C
C
C
C
C
C
A
has rank=2, but
X1 =
0
B
B
B
@
1 2
1 2
1 2
1
C
C
C
A
has rank=1.
The general model for the Chow test is
y1 = X1�1 + "1
y2 = X2�2 + "2
or more compactly
(6.3.1) y = W
0
@
�1
�2
1
A
+ "
where
W =
0
@
X1 0T1⇥k
0T2⇥k X2
1
A
and �1 and �2 are (k ⇥ 1) vectors.
Let
D =
0
@
0T1⇥T2
IT2
1
A .
As demonstrated in Section 6.1, a generic (T ⇥ 1) vector x, when premultiplied by M[D], is
transformed into the interaction of x and the time dummy for the first sub-period and when
6.3. THE CLASSICAL CHOW TEST 89
premultiplied by P[D], is transformed into the interaction of x and the time dummy for the
second sub period. Hence, the general model (6.3.1) can be equivalently written as
y = M[D]X�1 + P[D]X�2 + ",
or still equivalently, given M[D] = I � P[D],
(6.3.2) y = X�1 + P[D]X� + ",
where � ⌘ �2 � �1. Therefore, the null hypothesis of the Chow test, H0 : �1 = �2, is
equivalent to the exclusion restrictions H0 : � = 0, and so the test can be implemented 1)
by constructing just one set of interacted variables P[D]X and 2) carrying out an F test of
joint significance for the coefficients on the interacted variables after OLS estimation of model
(6.3.2).
As it turns out, the classical Chow test is a special case of the predictive Chow. Consider
model (6.3.2) and reformulate it by expanding P[D]
y = X�1 +D�
D0D��1
D0X� + "
or
y = X� +D� + ",
where � = �1 and � = (D0D)
�1D0X�. But since (D0D)
�1D0X = X2, then � = X2�, which
shows that � 2 R (X2) ✓ RT2 and so that the �’s here are not arbitrary as in the predictive
Chow test, where � 2 RT2 . This implies that, given rank (X2) = k, the two tests are identically
equal if and only if T2 = k, only in this case in fact R (X2) = RT2 .
CHAPTER 7
Large sample results for OLS and GLS estimators
7.1. Introduction
The linear regression model may present departures from LRM.4, such as heteroskedastic-
ity and/or cluster correlation. In this chapter we study common econometric techniques that
accommodate these issues, for both estimation and inference: primarily, the Generalized LS
(GLS) estimator for the regression coefficients and robust covariance estimators.
All known statistical properties are derived for n ! 1and so the techniques we consider
in this chapter work well in “large samples”.
I spell out the assumptions needed for consistency and asymptotic normality of OLS and
GLS estimators, providing the derivation of the large-sample properties.
Strict exogeneity is maintained throughout:
SE: E ("|X) = 0.
A weaker version of the random sampling assumption, one which does not maintain identical
distributions of records, is invoked when proving asymptotic normality and consistency for the
variance estimators:
RS: There is a sample of size n, such that the elements of the sequence {(yi x0i) , i = 1, ..., n}
are independent (NB not necessarily identically distributed) random vectors.
Results in this chapter are demonstrated through the do-file statistics_OLS.do using the
data-sets US_gasoline.dta and mus06data.dta (from Cameron and Trivedi 2009).90
7.2. OLS WITH NON-SPHERICAL ERROR COVARIANCE MATRIX 91
7.2. OLS with non-spherical error covariance matrix
I prove consistency in the general case of V ar ("|X) = ⌃, where ⌃ is an arbitrary and
unknown symmetric, p.d. matrix.
7.2.1. Consistency. The following assumptions hold.
OLS.1: plim⇣
X0⌃Xn
⌘
= lim
n!1E⇣
X0⌃Xn
⌘
, a positive definite, finite matrix
OLS.2: plim⇣
X0Xn
⌘
= Q, a positive definite, finite matrix
The proof of consistency goes as follows.
It has
b = � +
✓
X 0X
n
◆�1 X 0"
n
then
plim (b) = � + plim
✓
X 0X
n
◆�1✓X 0"
n
◆
= � +Q�1plim
✓
X 0"
n
◆
.
By strict exogeneity
E
✓
X 0"
n
◆
= 0
then
V ar
✓
X 0"
n|X
◆
=
1
nE
✓
X 0""0X
n|X
◆
=
1
n
X 0⌃X
n
and so
V ar
✓
X 0"
n
◆
=
1
nE
✓
X 0⌃X
n
◆
,
which goes to zero as n ! 1 by assumption OLS.1. Hence X 0"/n converges in squared mean,
and consequently in probability, to zero.
Clearly, the above implies that OLS is consistent in the classical case of LRM.4.
7.2. OLS WITH NON-SPHERICAL ERROR COVARIANCE MATRIX 92
7.2.2. Asymptotic normality with heteroskedasticity. Assumptions OLS.1 and OLS.2
hold along with RS and the following
OLS.3: V ar ("|X) = ⌃, where
⌃ =
2
6
6
6
6
6
6
6
4
�21 0 · · · 0
0 �22
. . . ...... . . . . . .
0
0 · · · 0 �2n
3
7
7
7
7
7
7
7
5
OLS.3 permits heteroskedasticity but not correlation. Partition X row-wise
X =
0
B
B
B
B
B
B
B
@
x
01
x
02
...
x
0n
1
C
C
C
C
C
C
C
A
.
By strict exogeneity E (xi"i) = 0 and hence
V ar (xi"i) = E⇥
E�
"2ixix0i|xi
�⇤
= �2iE
�
xix0i�
and
1
n
nX
i=1
V ar (xi"i) =
1
n
nX
i=1
�2iE
�
xix0i
�
= E
✓
X 0⌃X
n
◆
.
Therefore,
lim
n!1
1
n
nX
i=1
V ar (xi"i) = lim
n!1E
✓
X 0⌃X
n
◆
,
which is a finite matrix by assumption and so, by the (multivariate) Lindeberg-Feller theorem,
pn
n
nX
i=1
xi"i ⌘X 0"pn
!dN
✓
0, plim
✓
X 0⌃X
n
◆◆
.
7.2. OLS WITH NON-SPHERICAL ERROR COVARIANCE MATRIX 93
Eventually, given the rules for limiting distributions (Theorem D.16 in Greene (2008)),
pn (b� �) ⌘
✓
X 0X
n
◆�1 X 0"pn
!dQ�1X
0"pn,
and sopn (b� �) !
dN
✓
0, Q�1plim
✓
X 0⌃X
n
◆
Q�1
◆
.
7.2.3. White’s heteroskedasticity consistent estimator for the OLS standard
errors. Under OLS.3 and given the OLS residuals ei = yi�x
0ib, i = 1, ..., n, an heteroskedas-
ticity consistent estimator for the asymptotic covariance matrix of b,
Avar (b) =1
nQ�1plim
✓
X 0⌃X
n
◆
Q�1,
is given by the White’s estimator:
(7.2.1) \Avar (b) =�
X 0X��1
X 0ˆ
⌃X�
X 0X��1
,
where
ˆ
⌃ =
2
6
6
6
6
6
6
6
4
e21 0 · · · 0
0 e22. . . ...
... . . . . . .0
0 · · · 0 e2n
3
7
7
7
7
7
7
7
5
.
An equivalent way to express ˆ
⌃, one that will be used intensively in Chapters 8 and 9, is
the following
ˆ
⌃ = ee
0 ⇤ In,
where the symbol ⇤ stands for the element-by-element matrix product (also known as Hadamard
product).
Econometric softwares routinely compute robust OLS standard errors: these are just the
square roots of the main diagonal elements of \Avar (b) in (7.2.1). In Stata this is done through
the regress option vce(robust) (or, equivalently, simply robust).
7.2. OLS WITH NON-SPHERICAL ERROR COVARIANCE MATRIX 94
7.2.4. White’s heteroskedasticity test. The White’s estimator remains consistent un-
der homoskedasticity, therefore one can test for heteroskedasticity by assessing the statistical
discrepancy between s2 (X 0X)
�1 and (X 0X)
�1X 0ˆ
⌃X (X 0X)
�1. Under the null hypothesis
of homoskedasticity, the discrepancy will be small. This is the essence of the White’s het-
eroskedasticty test. The statistics measuring such discrepancy can be implemented through
the following auxiliary regression including the constant term.
(1) Generate the squared OLS residuals, e2 = e ⇤ e
(2) Do the OLS auxiliary regression that uses e2 as the dependent variable and the
following regressors: all k variables in the n ⇥ k sample matrixX =
⇣
X1 1
⌘
and
all interaction variables and squared variables in X1. This implies that for any two
columns of X1, say variables xi and xj , there are the additional regressors x1 ⇤ x2,
x1⇤x1 and x2⇤x2. The auxiliary regression, therefore, has p ⌘ k (k + 1) /2 regressors.
(3) Save the R-squared of the auxiliary regression, say R2a , and multiply it by the sample
size n. The resulting statistics nR2a ⇠
A�2
(p� 1) measures the statistical discrepancy
between the two covariance estimators and so provides a convenient heteroskedasticity
test: reject homoskedasticity when nR2a is larger than a conventional percentile of
choice for �2(p� 1).
We may implement the White test manually, saving the OLS residuals through predict and
then generating squares and interactions as appropriate, or more easily by giving the following
post-estimation command after regress: imtest, white.
7.2.5. Clustering. Clustering of observations along a given dimension is the norm in
microeconometric applications. For example, firms cluster across different sectors, households
live in different provinces, immigrants in a given country belong to different ethnic groups and
so on.
7.2. OLS WITH NON-SPHERICAL ERROR COVARIANCE MATRIX 95
Clustering cannot be neglected in empirical work. In the case of firm data, for example,
it is likely that there is correlation across the productivity shocks hitting firms in the same
sectoral cluster, with a resulting bias in the standard error estimates, even if White robust.
The White estimator can be made robust to cluster correlation quite easily. I explain
this in terms of the firm data example. Assume that we have cross-sectional data of n firms,
indexed by i = 1, ..., n. There are G sectors, indexed by g = 1, ..., G and we know which sector
g = 1, ..., G firm i = 1, ..., n belongs to. This information is contained in the the n⇥G matrix
D of sectoral indicators: the element of D in row i and column j, say d (i, j), is unity if firm
i belongs to sector j and zero if not. The cluster-correlation and heteroskedasticity consistent
estimator for the asymptotic covariance matrix of b is then assembled by simply replacing ˆ
⌃
in Equation (7.2.1) with
ˆ
⌃c = ee
0 ⇤DD0.
Stata does this through the regress option vce(cluster clustervar), where clustervar is
the name of the cluster identifier in the data set.
Chapter 9 will cover cases of multi-clustering, that is data that are grouped along more
than one dimension.
7.2.6. Average variance estimate (skip it). I prove now that a consistent estimate of
the average variance
�2n =
1
n
nX
i=1
�2i ,
is given by
s2n =
1
n
nX
i=1
e2i ,
in the sense that plim�
s2n � �2n
�
= 0 (NB I use this formulation and not plim�
s2n�
= �2n as �2
n
is a sequence).
Since
s2n =
"0"
n�✓
"0X
n
◆✓
X 0X
n
◆�1✓X 0"
n
◆
,
7.3. GLS 96
plim�
s2n�
= plim
✓
"0"
n
◆
+ 0
0Q�10
= plim
✓
"0"
n
◆
.
By the RS assumption the squared errors, "2i , are all independently distributed with means
E�
"2i�
⌘ �2i , and given that
"0"
n=
1
n
nX
i=1
"2i ,
I can apply the Markov’s strong law of large numbers to have
plim
"
✓
"0"
n
◆
� 1
n
nX
i=1
�2i
#
= 0.
7.3. GLS
The estimation strategy described in the previous sections is based on OLS estimates for the
regression coefficients with standard errors estimates corrected for heteroskedasticity and/or
cluster correlation. The drawback of the approach is a loss in efficiency, if the departures from
LRM.4 are of a known form. We will see that in this case the BLUE can always be found.
To formalize the new set-up, let V ar ("|X) ⌘ ⌃ = �2⌦, where ⌦ is a known symmetric,
positive definite (p.d.) (n⇥ n) matrix and �2 is an unknown strictly positive scalar (that is,
⌃ is known up to a strictly positive multiplicative scalar.)
Since ⌦ is symmetric and p.d., it can be always factorized as ⌦ = C⇤C 0 where ⇤ is
an (n⇥ n) diagonal matrix with the main diagonal elements all strictly positive and C is a
(n⇥ n) matrix such that C 0C = I, implying that C 0 is the inverse of C and consequently that
CC 0= I (⇤ and C are said, respectively, the eigenvalues (or also characteristic roots) and the
eigenvectors (or also characteristic vectors) matrices of ⌦).
A great benefit of the foregoing factorization is that it permits to compute the inverse of
⌦ effortlessly. In fact, it is possible to verify that
(7.3.1) ⌦
�1= C⇤
�1C 0
7.3. GLS 97
and
(7.3.2) ⌦
�1/2= C⇤
�1/2C 0,
where ⌦
�1 is the inverse of ⌦, ⌦
�1/2⌦
�1/2= ⌦
�1, ⇤
�1 is the inverse of ⇤ and ⇤
�1/2 is
a diagonal matrix with main-diagonal elements equal to the square root reciprocals of the
main-diagonal elements of ⇤.
Consider the GLS transformed model
(7.3.3) y
⇤= X⇤� + "⇤,
such that y
⇤ ⌘ ⇤
�1/2C 0y, X⇤ ⌘ ⇤
�1/2C 0X and "⇤ ⌘ ⇤
�1/2C 0".
Exercise 42. Verify by direct inspection that indeed ⌦
�1⌦ = ⌦⌦
�1= ⌦ and ⌦
�1/2⌦
�1/2=
⌦
�1.
Solution. Remember that ⇤ is diagonal and so ⇤
�1⇤ = ⇤⇤
�1= I. Then,
⌦
�1⌦ = C⇤
�1C 0C⇤C 0= C⇤
�1I⇤C 0= C⇤
�1⇤C 0
= CC 0= I
and
⌦⌦
�1= C⇤C 0C⇤
�1C 0= C⇤I⇤�1C 0
= C⇤⇤
�1C 0= CC 0
= I.
The rest is proved similarly on considering that ⇤
�1/2 is diagonal and so ⇤
�1/2⇤
�1/2= ⇤
�1.
Exercise 43. Use (7.3.1) and (7.3.2) to prove 1) X⇤0X⇤= X 0
⌦
�1X; 2) X⇤0"⇤ = X 0⌦
�1";
and 3) V ar ("⇤|X) = �2In then use the general law of iterated expectation to prove that also
V ar ("⇤|X⇤) = �2In
Given the results of the foregoing exercise, the OLS applied to the transformed model
(7.3.3) is the Gauss-Marcov estimator for � and has the formula
bGLS =
⇣
X⇤0X⇤⌘�1
X⇤0y
⇤
=
�
X 0⌦
�1X��1
X 0⌦
�1y(7.3.4)
7.3. GLS 98
with V ar (bGLS |X) = �2�
X 0⌦
�1X��1
.
Exercise 44. Let ⌦
�1/2= C⇤
�1/2C 0 and prove that
(7.3.5) ⌦
�1/2y = ⌦
�1/2X� + ⌦
�1/2"
is also a GLS transformation, that is OLS applied to model (7.3.5) yields bGLS .
Solution: By exercise 42
⇣
X 0⌦
�1/2⌦
�1/2X⌘�1
X 0⌦
�1/2⌦
�1/2y =
�
X 0⌦
�1X��1
X 0⌦
�1y.
The estimator bGLS is OLS applied to a classical regression model and as such it is BLUE.
The following exercise asks you to verify by direct inspection that GLS is “better than” OLS
in terms of covariance.
Exercise 45. Prove that
�2�
X 0X��1
X 0⌦X
�
X 0X��1 � �2
�
X 0⌦
�1X��1
is a n.n.d. matrix.
Solution: We define a k ⇥ n matrix D as
D ⌘�
X 0X��1
X 0 ��
X 0⌦
�1X��1
X 0⌦
�1.
Therefore,�
X 0X��1
X 0=
�
X 0⌦
�1X��1
X 0⌦
�1+D.
On noticing that DX = 0k⇥k,
�
X 0X��1
X 0⌦X
�
X 0X��1
=
h
�
X 0⌦
�1X��1
X 0⌦
�1+D
i
⌦
h
⌦
�1X�
X 0⌦
�1X��1
+D0i
=
�
X 0⌦
�1X��1
+D⌦D0.
7.3. GLS 99
Since ⌦ is p.d., then for any n⇥ 1 vector z, z0⌦z � 0, being equal to zero if and only if z = 0.
But then, z0⌦z � 0 when, in particular, z = D0w for any n⇥ 1 vector w, which is equivalent
to saying that w
0D⌦D0w � 0 for any n ⇥ 1 vector w, or that D⌦D0 is n.n.d, proving the
result.
7.3.1. Consistency of GLS. The following assumption hold.
GLS.1: plim⇣
X0⌦�1Xn
⌘
= lim
n!1E⇣
X0⌦�1Xn
⌘
= Q, a positive definite, finite matrix
Exercise 46. Given that
bGLS = � +
✓
X 0⌦
�1X
n
◆�1X 0
⌦
�1"
n,
prove that
plim (bGLS) = �
under assumption GLS.1 and strict exogeneity (SE).
Solution. Easy: just write
bGLS = � +
X⇤0X⇤
n
!�1X⇤0"⇤
n,
then consider that V ar ("⇤|X⇤) = �2In (see Exercise 43) and, finally, follow the same steps as
in Section 7.2.1.
7.3.2. Asymptotic normality. I prove asymptotic normality for bGLS under GLS.1,
SE and RS (again, remember that V ar ("⇤|X⇤) = �2In).
By strict exogeneity E (x
⇤i "
⇤i ) = 0 and hence
V ar (x⇤i "
⇤i ) = �2E
⇣
x
⇤ix
⇤0i
⌘
7.3. GLS 100
and
1
n
nX
i=1
V ar (x⇤i "
⇤i ) =
1
n�2
nX
i=1
E⇣
x
⇤ix
⇤0i
⌘
= �2E
✓
X 0⌦
�1X
n
◆
.
Therefore,
lim
n!1
1
n
nX
i=1
V ar (x⇤i "
⇤i ) = �2
lim
n!1E
✓
X 0⌦
�1X
n
◆
,
which is a finite matrix by assumption. By the Lindeberg-Feller central limit theorem,pn
n
nX
i=1
x
⇤i "
⇤i ⌘
X⇤0"⇤pn
⌘ X 0⌦
�1"pn
!dN�
0,�2Q�
and sincepn (bGLS � �) ⌘
✓
X 0⌦
�1X
n
◆�1X 0
⌦
�1"pn
!dQ�1X
0⌦
�1"pn
,
thenpn (bGLS � �) !
dN�
0,�2Q�1�
.
The asymptotic covariance matrix of bGLS is therefore
Avar (bGLS) =�2
nQ�1,
and is estimated by\Avar (bGLS) = s2GLS
�
X 0⌦
�1X��1
where
s2GLS =
(y⇤ �X⇤bGLS)0(y⇤ �X⇤bGLS)
n� k.
=
(y �XbGLS)0⌦
�1(y �XbGLS)
n� k.
7.3. GLS 101
Exercise 47. (This may be skipped) Under GLS.1, SE and RS, prove that plim�
s2GLS
�
=
�2.
7.3.3. Feasible GLS. In general situations we may know the form of ⌦ but not the
values taken on by its elements. Therefore to make GLS operational we need an estimate of
⌦, say b
⌦. Replacing ⌦ by b
⌦ into (7.3.4) delivers the feasible GLS, henceforth FGLS:
bFGLS =
⇣
X 0b
⌦
�1X⌘�1
X 0b
⌦
�1y.
Since GLS is consistent, to know that bGLS and bFGLS are asymptotically equivalent, i.e.
plim (bFGLS � bGLS) = 0, is enough to ensure that bFGLS is consistent but not that
pn (bFGLS � �) �!
dN�
0,�2Q�1�
.
For this we need the stronger condition thatpn (bFGLS � �) and
pn (bFGLS � �) be asymp-
totically equivalent, or
(7.3.6)pn (bFGLS � bGLS) �!
p0.
Two sufficient conditions for (7.3.6) are the following
plim
X 0b
⌦
�1"pn
� X 0⌦
�1"pn
!
= 0(7.3.7)
plim
X 0b
⌦
�1X
n� X 0
⌦
�1X
n
!
= 0.(7.3.8)
Exercise 48. Use condition GLS.1 that
plim
✓
X 0⌦
�1X
n
◆
= Q,
the similar condition that
plim
✓
X 0⌦
�1"pn
◆
= q0,
7.3. GLS 102
where q0 is a finite vector, and the Slutsky Theorem (if g is a continuous function, then
plim [g (z)] = g [plim (z)], p. 1113) to verify that
plim
X 0b
⌦
�1"pn
� X 0⌦
�1"pn
!
= 0
plim
X 0b
⌦
�1X
n� X 0
⌦
�1X
n
!
= 0
are sufficient for (7.3.6).
Solution: Given
plim
✓
X 0⌦
�1X
n
◆
= Q
and
plim
X 0b
⌦
�1X
n� X 0
⌦
�1X
n
!
= 0,
then
plim
✓
X 0⌦
�1X
n
◆
+ plim
X 0b
⌦
�1X
n� X 0
⌦
�1X
n
!
= Q,
and so applying Slutsky Theorem twice, we get
plim
X 0⌦
�1X
n+
X 0b
⌦
�1X
n� X 0
⌦
�1X
n
!
= plim
X 0b
⌦
�1X
n
!
= Q
and
(7.3.9) plim
X 0b
⌦
�1X
n
!�1
= Q�1.
By the same token,
(7.3.10) plim
X 0b
⌦
�1"pn
!
= q0.
But now, since
X 0b
⌦
�1X
n
!�1X 0
b
⌦
�1"pn
=
pn⇣
X 0b
⌦
�1X⌘�1
X 0b
⌦
�1",
7.4. LARGE SAMPLE TESTS 103
✓
X 0⌦
�1X
n
◆�1X 0
⌦
�1"pn
=
pn�
X 0⌦
�1X��1
X 0⌦
�1",
bFGLS � � =
⇣
X 0b
⌦
�1X⌘�1
X 0b
⌦
�1" and bGLS � � =
�
X 0⌦
�1X��1
X 0⌦
�1", then
X 0b
⌦
�1X
n
!�1X 0
b
⌦
�1"pn
=
pn (bFGLS � �) ,
✓
X 0⌦
�1X
n
◆�1X 0
⌦
�1"pn
=
pn (bGLS � �) .
The last two equalities, along with the maintained conditions (7.3.7) and (7.3.8), the asymp-
totic results (7.3.9) and (7.3.10) and the Slutsky Theorem, prove that bothpn (bGLS � �)
andpn (bFGLS � �) converge in probability to the same limit, Q�1
q0.
Conditions (7.3.7) and (7.3.8) must be verified on a case-by-case basis. Importantly, they
may hold even in cases in which b
⌦ is not consistent, as shown in the context of FGLS panel
data estimators by Prucha (1984).
7.4. Large sample tests
7.4.1. Introduction. This section covers large sample tests in more detail than Greene
(2008). For the exam you can skip the derivations of the asymptotic results.
Assume the following results hold
(1)pn (b� �) �!
dN�
0,�2Q�1�
(2) plim⇣
X0Xn
⌘
= Q
(3) plim�
s2�
= �2.
and consider the following lemma, referred to as the product rule. For more on this see
White (2001) p. 67 (notice that the product rule is not mentioned in Greene (2008), although
implicitly used for proving the asymptotic distributions of the tests).
7.4. LARGE SAMPLE TESTS 104
Lemma 49. (The product rule) Let An be a sequence of random (l ⇥ k) matrices and bn a
sequence of random (k ⇥ 1) vectors such that plim (An) = 0 and bn !dz. Then, plim (Anbn) =
0.
7.4.2. The t-ratio test (skip derivations). We wish to derive the asymptotic distri-
bution of the t-ratio test for the null hypothesis Ho : �k = �ok. We begin by noting that under
Ho
(7.4.1)pn (bk � �o
k)q
�2Q�1kk
�!d
N (0, 1)
by result 1 and Theorem D.16(2) in Greene (2008) (p. 1050).
Then, the t-ratio test for Ho is
t =(bk � �o
k)q
s2 (X 0X)
�1kk
,
where (X 0X)
�1kk ⌘
⇣
x
0kM
[
X(k)]xk
⌘�1and X =
�
X(k) xk
�
(see Section 4.8). Since t can be
reformulated as
t =
pn (bk � �o
k)q
s2�
X0Xn
��1
kk
,
then
plim
2
4
pn (bk � �o
k)q
s2�
X0Xn
��1
kk
�pn (bk � �o
k)q
�2Q�1kk
3
5
=(7.4.2)
plim
8
<
:
0
@
1
q
s2�
X0Xn
��1
kk
� 1
q
�2Q�1kk
1
A
pn (bk � �o
k)
9
=
;
= 0
where the second equality follows from the product rule, given that, by results 2-3 and the
Slutsky Theorem (Theorem D.12 in Greene (2008), p. 1045), the first factor in the second plim
converges in probability to zero and, by result 1., the second factor converges in distribution
to a normal random scalar. Hence, the two sequences in the plim of equation (7.4.2) are
7.4. LARGE SAMPLE TESTS 105
asymptotically equivalent and by Theorem D.16(3) have the same limiting distribution. Given
(7.4.1), this proves that(bk � �o
k)q
s2 (X 0X)
�1kk
�!d
N (0, 1) .
Consider, now, the general case of a null hypothesis Ho : r
0� = q, where r is a non-
zero (k ⇥ 1) vector of non-random constants and q is a non-random scalar. Using the same
approach as above it is possible to prove that
(7.4.3)r
0(b� �)
q
s2r0 (X 0X)
�1r
�!d
N (0, 1) .
Exercise 50. (skip) Prove (7.4.3). Hint: by the Slutsky Theorem, plim
r
0⇣
X0Xn
⌘�1r
�
=
r
0Q�1r.
7.4.3. The Chi-squared test (skip derivations). We wish to test the null hypothesis
Ho : R��q = 0, where R is a non-random (J ⇥ k) matrix of full-row rank and q is a (J ⇥ 1)
column vector. Under Ho, R� = q and so the F test can be written as
F =
(b� �)0R0h
s2R (X 0X)
�1R0i�1
R (b� �)
J.
The foregoing equation can be rearranged as
JF =
pn (b� �)0R0
"
s2R
✓
X 0X
n
◆�1
R0
#�1
Rpn (b� �) .
Now let A ⌘ �2RQ�1R0. Since A is p.d. (R is f.r.r.), there exists a p.d. matrix A1/2 such
that A1/2A1/2= A and A�1/2
=
�
A1/2��1. Then, by result 1. and the Slutsky Theorem,
(7.4.4) A�1/2Rpn (b� �) �!
dN (0, IJ) .
Similarly, let ˆA ⌘ s2R (X 0X/n)�1R0. Since ˆA is p.d., there exists a p.d. matrix ˆA1/2 such
that ˆA1/2ˆA1/2
=
ˆA and ˆA�1/2=
⇣
ˆA1/2⌘�1
. Then
7.4. LARGE SAMPLE TESTS 106
plimh
ˆA�1/2Rpn (b� �)�A�1/2R
pn (b� �)
i
=(7.4.5)
plimh⇣
ˆA�1/2 �A�1/2⌘
Rpn (b� �)
i
= 0
where the second equality follows from the product rule given that
plim⇣
ˆA�1/2⌘
= A�1/2,
by results 2. and 3. and the Slutsky Theorem, and
Rpn (b� �) �!
dN (0, A) ,
by result 1. and Theorem D.16(2) in Greene (2008). Hence, by Theorem D.16(3) the two
sequences in the left-hand-side plim of equation (7.4.5) have the same limiting distribution
and given (7.4.4), this proves that
ˆA�1/2Rpn (b� �) �!
dN (0, IJ) .
Let w ⌘ ˆA�1/2Rpn (b� �) , then by Theorem D.16(2),
(7.4.6) w
0w �!
d�2
(J) .
But since
ˆA�1/2ˆA�1/2
=
ˆA�1=
"
s2R
✓
X 0X
n
◆�1
R0
#�1
,
then
w
0w =
pn (b� �)0R0
ˆA�1/2ˆA�1/2R
pn (b� �) =
pn (b� �)0R0
"
s2R
✓
X 0X
n
◆�1
R0
#�1
Rpn (b� �) = JF,
7.4. LARGE SAMPLE TESTS 107
and so by (7.4.6)
JF �!d
�2(J) .
CHAPTER 8
Fixed and Random Effects Panel Data Models
8.1. Introduction
This chapter covers the two most important panel data models: the fixed effect and the
random effect models.
For simplicity we start directly from the statistical models. The sampling mechanism will
be introduced when proving asymptotic normality.
Results in this chapter are demonstrated through the do-file paneldata.do using the data-
set airlines.dta, a panel data that I have extracted from costfn.dta (Baltagi et al. 1998).
8.2. The Fixed Effect Model (or Least Squares Dummy Variables Model)
Consider the following panel data regression model expressed at the observation level, that
is for individual i = 1, ...N and time t = 1, ..., T :
(8.2.1) yit = x
0it� + ↵i + "it
where x
0it = (x1
it
, ..., xkit
),
� =
0
B
B
B
@
�1...
�k
1
C
C
C
A
and ↵i is a scalar denoting the time-constant, individual specific effect for the individual i.108
8.2. THE FIXED EFFECT MODEL (OR LEAST SQUARES DUMMY VARIABLES MODEL) 109
Define djit
as the value taken on by the dummy variable indicating individual j = 1, ..., N
at the observation (i, t) , that is
djit
=
8
<
:
1 if i = j, any t = 1, ..., T
0 if i 6= j, any t = 1, ..., T.
Then, model (8.2.1) can be equivalently written as
(8.2.2) yit = x
0it� + d1
it
↵1 + ...+ diit
↵i + ...+ dNit
↵N + "it,
i = 1, ...N and t = 1, ..., T.
In a more compact notation, at the individual level i = 1, ..., N, model (8.2.2) is written
as
yi = Xi� + d1i
↵1 + ...+ dii
↵i + ...+ dNi
↵N + "i,
where
yi(T⇥1)
=
0
B
B
B
B
B
B
B
B
B
B
@
yi1...
yit...
yiT
1
C
C
C
C
C
C
C
C
C
C
A
, Xi(T⇥k)
=
0
B
B
B
B
B
B
B
B
B
B
@
x
0i1...
x
0it...
x
0iT
1
C
C
C
C
C
C
C
C
C
C
A
, "i(T⇥1)
=
0
B
B
B
B
B
B
B
B
B
B
@
"i1...
"it...
"iT
1
C
C
C
C
C
C
C
C
C
C
A
and
dji
=
8
<
:
1T if i = j
0T if i 6= j,
1T indicates the (T ⇥ 1) vector of all unity elements and 0T the (T ⇥ 1) vector of all zero
elements.
Stacking data by individuals, an even more compact representation of the regression model
(8.2.2), at the level of the whole data-set, is
(8.2.3) y = X� +D↵+ ",
8.2. THE FIXED EFFECT MODEL (OR LEAST SQUARES DUMMY VARIABLES MODEL) 110
where
y
(NT⇥1)=
0
B
B
B
B
B
B
B
B
B
B
@
y1
...
yi
...
yN
1
C
C
C
C
C
C
C
C
C
C
A
, X(NT⇥k)
=
0
B
B
B
B
B
B
B
B
B
B
@
X1
...
Xi
...
XN
1
C
C
C
C
C
C
C
C
C
C
A
, "(NT⇥1)
=
0
B
B
B
B
B
B
B
B
B
B
@
"1...
"i...
"N
1
C
C
C
C
C
C
C
C
C
C
A
, ↵(N⇥1)
=
0
B
B
B
B
B
B
B
B
B
B
@
↵1
...
↵i
...
↵N
1
C
C
C
C
C
C
C
C
C
C
A
and D is the (NT ⇥N) matrix of dummy variables
D =
0
B
B
B
B
B
B
B
B
B
B
@
1T 0T · · · 0T
0T 1T · · · 0T
......
...
0T 0T · · · 0T
0T 0T · · · 1T
1
C
C
C
C
C
C
C
C
C
C
A
.
or equivalently D = (d1 d2... dN ) . Under the following assumptions model (8.2.3) is a classical
regression model that includes individual dummies:
FE.1: The extended regressor matrix (X D) has f.c.r. Therefore, not only is X of f.c.r.,
but also none of its columns can be expressed as a linear combination of the dummy
variables, which boils down to saying that no column of X can be time-constant,
which in turn implies that X does not include the unity vector (indeed, there is a
constant term in model (8.2.3), but one that jumps across individuals).
FE.2: E ("|X) = 0. Hence, the variables in X are strictly exogenous with respect to ",
but the statistical relationship with ↵ is left completely unrestricted. Model (8.2.3),
therefore, automatically accommodates any form of omitted-variable bias due to the
omission of time constant regressors. Notice that D is taken as a non-random matrix,
therefore conditioning on (X D) or simply X is exactly the same.
8.2. THE FIXED EFFECT MODEL (OR LEAST SQUARES DUMMY VARIABLES MODEL) 111
FE.3: V ar ("|X) = �2"INT . This is standard. It can be relaxed in more advanced
treatment of the topic, as in Arellano (2003) for example, but see also Section 8.7
(and Chapter 9, which can however be skipped for the exam).
Exercise 51. Prove that the following model with the constant term is an equivalent
reparametrization of Model (8.2.3):
y = 1NT ↵0 +X� +D�1 ˜↵�1 + ",(8.2.4)
where D�1 = (d2... dN ), ˜↵�1 ⌘ ↵�1�1N�1↵1, ↵�1 = (↵2...↵N )
0, 1s denotes the s⇥ 1 vector
of all unity elements and ↵0 ⌘ ↵1.
Solution. Partition D = (d1 D�1) and
↵ =
0
@
↵1
↵�1
1
A
Then, rewrite model (8.2.3) equivalently as
(8.2.5) y = d1↵1 +X� +D�1↵�1 + "
Since D1N = 1NT , then (d1 D�1)1N = 1NT or equivalently
d1 +D�11N�1 = 1NT .
Therefore, we can reparametrize model (8.2.5) adding 1NT↵1 and subtracting (d1 +D�11N�1)↵1
to the right hand side of (8.2.5) to have
y = 1NT↵1 + d1↵1 +X� +D�1↵�1 � (d1 +D�11N�1)↵1 + "
= 1NT↵1 +X� +D�1↵�1 �D�11N�1↵1 + "
= 1NT ↵0 +X� +D�1 ˜↵�1 + ".
where ↵0 ⌘ ↵1 and ˜↵�1 ⌘ ↵�1 � 1N�1↵1.
8.2. THE FIXED EFFECT MODEL (OR LEAST SQUARES DUMMY VARIABLES MODEL) 112
Remark 52. Exercise 51 demonstrates that after the reparametrization the interpretation
of the � coefficients is unchanged, the constant term is ↵1 and the coefficients on the remaining
individual dummies are no longer the individual effects of the remaining individuals, ↵i, i =
2, ..., N , but rather the contrasts of ↵i with respect to ↵1, i = 2, ..., N . Of course, the reference
individual must not necessarily be the first one in the data-set and can be freely chosen among
the N individuals by the researcher at her/his own convenience. In Stata this is implemented
by using regress followed by the dependent variable, the X regressors and N � 1 dummy
variables (see paneldata.do).
Remark 53. The interpretation of the constant in Exercise 51 is different from that in
the Stata transformation (see 10/04/12 Exercises) of Model (8.2.3). In the former case the
constant term is the effect of the individual whose dummy is removed from the regression, in
the latter it is the average of the N individual effects.
The LSDV estimator is just the OLS estimator applied to model (8.2.3) and, given FE.1-3,
it is the BLUE. The separate formulas of LSDV for � and ↵ are obtained by applying the
FWL Theorem to (8.2.3). So,
bLSDV =
�
X 0M[D]X��1
X 0M[D]y
is the LSDV estimator for � and
aLSDV =
�
D0D��1
D0(y �XbLSDV )
is the LSDV estimator for ↵. As already mentioned, both are BLUE’s, but while bLSDV
converges in probability to � when N ! 1 or T ! 1 or both, aLSDV converges in probability
to ↵ only when T ! 1. This discrepant large-sample behavior of bLSDV and aLSDV is due
to the fact that the dimension of ↵ increases as N increases, whereas that of � is kept fixed
to k.
8.2. THE FIXED EFFECT MODEL (OR LEAST SQUARES DUMMY VARIABLES MODEL) 113
Exercise 54. Verify that
�
D0D��1
=
0
B
B
B
B
B
B
B
@
1/T 0 · · · 0
0 1/T...
...... . . .
0
0 0 · · · 1/T
1
C
C
C
C
C
C
C
A
.
Given exercise 54, (D0D)
�1D0=
1T D
0 and so for a generic (NT ⇥ 1) vector z
�
D0D��1
D0z =
0
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
@
1T
TX
t=1
z1t
...
1T
TX
t=1
zit
...
1T
TX
t=1
zNt
1
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
A
⌘
0
B
B
B
B
B
B
B
B
B
B
@
z1...
z1...
zN
1
C
C
C
C
C
C
C
C
C
C
A
⌘ ¯
z.
In words, premultiplying any (NT ⇥ 1) vector z by (D0D)
�1D0 transforms it into an (N ⇥ 1)
vector of means, ¯z, where each mean is taken over the group of observations peculiar to the
same individual and for this reason is said a group mean. Therefore,
aLSDVi
=
¯
yi � ¯
x
0ibLSDV ,
8.2. THE FIXED EFFECT MODEL (OR LEAST SQUARES DUMMY VARIABLES MODEL) 114
where ¯
x
0i is the (1⇥ k) vector of group-means for individual i, ¯
x
0i =
⇣
x1i
. . . xki
⌘
. It is
also clear that for any (NT ⇥ 1) vector z
P[D]z = D�
D0D��1
D0z =
0
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
@
z1...
z1...
zi...
zi...
zN...
zN
1
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
A
.
In words, premultiplying any (NT ⇥ 1) vector z by P[D] transforms it into a sample-conformable
(NT ⇥ 1) vector of group-means: each group-mean is repeated T times. It follows that
M[D]z = z � P[D]z is the (NT ⇥ 1) vector of group-mean deviations. Therefore, one can
obtain bLSDV by applying OLS to the model transformed in group-means deviations, that is
regressing M[D]y on M[D]X.
Exercise 55. Verify (in a couple of seconds...) that P[D] =1T DD0.
The conditional variance-covariance matrix of bLSDV is V ar (bLSDV |X) = �2"
�
X 0M[D]X��1
.
It is estimated by replacing �2" with the Anova estimator s2LSDV , based on the LSDV residuals
eLSDV = M[D]y �M[D]XbLSDV :
(8.2.6) s2LSDV =
e
0LSDV eLSDV
NT �N � k.
8.2. THE FIXED EFFECT MODEL (OR LEAST SQUARES DUMMY VARIABLES MODEL) 115
Exercise 56. Prove that E�
s2LSDV
�
= �2" . This is a long one, but when done you can
tell yourself “BRAVO!” I just give you a few hints. First, on noting that y is determined by
the right hand side of (8.2.3) prove that e = M[
M[D]X]M[D]", then elaborate the conditional
mean of "0M[D]M[
M[D]X]M[D]" using the trace operator as we did for s2, finally apply the law
of iterated expectations.
It is not hard to verify (do it) that bLSDV can be obtained from the OLS regression of
model (8.2.3) transformed in group-means deviations (this transformation is referred to in the
panel-data literature as the within transformation)
(8.2.7) M[D]y = M[D]X� +M[D]"
The intuition is simple: since the group mean of any time constant element, as ↵i, coincides
with the element itself, all time-constant elements in model (8.2.3) are wiped out, this also
explains why X cannot contain time-constant variables. So, in a sense, the within transfor-
mation controls out the whole time-constant heterogeneity, latent or not, in model (8.2.3),
making it look like almost as a classical LRM. In particular, it can be proved easily that
LRM.1-LRM.3 hold. Notice, however, that errors in the transformed model, M[D]", have a
non-diagonal conditional covariance matrix (it is, indeed, block-diagonal and singular, can
you derive it?). Specifically, the vector M[D]" presents within-group serial correlation, since
for each individual group there are only T � 1 linearly independent demeaned errors. As a
consequence, LRM.4 does not apply to model (8.2.7). All the same, bLSDV is BLUE. This
is true because the condition of Theorem 38 in Section 5.4 is met (if you have answered the
previous question on the covariance matrix of M[D]", you should be able to verify also this
claim).
One should not conclude from the foregoing discussion that OLS on the within transformed
model (8.2.7) is a safe strategy. As in the Oaxaca’s pooled model of section 5.2, the fact that
the error covariance matrix is not spherical, presenting in this specific case within group serial
correlation, has bad consequences as far as standard error estimates are concerned. Indeed,
8.2. THE FIXED EFFECT MODEL (OR LEAST SQUARES DUMMY VARIABLES MODEL) 116
should we leave the econometric software free to treat model (8.2.7) as a classical LRM, and so
regress M[D]y on M[D]X, it would compute coefficient estimates just fine, but would estimate
V ar (bLSDV |X) by s2�
X 0M[D]X��1, with s2 = e
0LSDV eLSDV / (NT � k) 6= s2LSDV , which is
biased since it uses a wrong degrees of freedom correction. The econometric software cannot
be aware that for each individual in the sample there are only T � 1 linearly independent
demeaned errors and so, rather than dividing the residual sum of squares by N (T � 1)� k, it
divides it by NT �k. The upshot is that standard errors estimated in this way needs rectifying
by multiplying each of them by the correction factorp
(NT � k) / (NT �N � k).
An interesting assumption to test is that of the absence of individual heterogeneity, H0 :
↵1 = ↵2 = ... = ↵N . Under the restriction implied by H0, model (8.2.3) pools together all
data with no attention to the individual clustering and can be written as
(8.2.8) y = X⇤�⇤+ ",
where
X⇤= (1NT X) , �⇤
=
0
@
↵0
�
1
A .
Hence, under H0, the pooled OLS (POLS) estimator
(8.2.9) b
⇤POLS =
⇣
X⇤0X⇤⌘�1
X⇤y
is the BLUE. Let e⇤ indicate the restricted residual vector
(8.2.10) ePOLS = y �X⇤bPOLS ,
then under normality and H0
(8.2.11) F =
(e
0POLSePOLS � e
0LSDV eLSDV ) /N � 1
e
0LSDV eLSDV /NT �N � k
⇠ F (N � 1, NT �N � k) .
8.3. THE RANDOM EFFECT MODEL 117
If F does not reject H0, POLS is a legitimate, more efficient than LSDV, estimation procedure.
If F rejects H0, then POLS is biased and LSDV should be adopted.
Exercise 57. On reparametrizing the LSDV model as in Exercise 51, the hypothesis of no
individual heterogeneity becomes H0 : ˜↵�1 = 0. Prove that the resulting F-test is numerically
identical to F in Equation (8.2.11).
Solution. Easy. Since models (8.2.3) and (8.2.4) are indeed the same model, the result-
ing F-test is numerically identical to the F-test in Equation (8.2.11). This is demonstrated
empirically in the paneldata.do Stata dofile.
8.3. The Random Effect Model
The random effect model has the same algebraic structure of model (8.2.1). At the obser-
vation level, i = 1, ...N and t = 1, ..., T, we have
(8.3.1) yit = x
0it� + ↵i + "it
where x
0it = (x1
it
, ..., xkit
),
� =
0
B
B
B
@
�1...
�k
1
C
C
C
A
and ↵i is a scalar denoting the time-constant, individual specific effect for individual i. The
statistical properties of model (8.3.1) are different, though. Without loss of generality, write
↵i as ↵i = ↵0 + ui and let
u
(N⇥1)=
0
B
B
B
B
B
B
B
B
B
B
@
u1...
ui...
uN
1
C
C
C
C
C
C
C
C
C
C
A
.
Model (8.3.1) can then be written compactly as
8.3. THE RANDOM EFFECT MODEL 118
(8.3.2) y = X⇤�⇤+w,
where
X⇤= (1NT X) , �⇤
=
0
@
↵0
�
1
A
and w = "+Du.
The following is maintained.
RE.1: X⇤ has f.c.r. This is equivalent to 1) X of f.c.r and 2) no linear combination of
the columns of X is equal to 1NT . Hence, as long as these two requirements are met,
X can contain time-constant variables.
RE.2: E ("|X⇤) = 0 and E (u|X⇤
) = 0. This maintains strict exogeneity of X with
respect to both components of w, and so with respect to w itself. It is a stringent
assumption, implying that the time constant variables that are not included into
the regression are unrelated to the included regressors X. Notice that since 1NT is
non-random, conditioning onX⇤ is the same as conditioning on X.
RE.3: V ar ("|X⇤) = �2
"INT , V ar (u|X⇤) = �2
uIN , Cov (",u|X⇤) = E ("u0|X⇤
) =
0
(NT⇥N).
Let ⌃ ⌘ V ar (w|X⇤) . Then, given RE.3,
⌃ = V ar ("|X⇤) + V ar (Du|X⇤
)
= �2"INT + �2
uDD.0
This means that under RE.1-3 w, although homoskedastic, is non-diagonal and the POLS
estimator in (8.2.9) is unbiased (verify this) but not BLUE (unless �2u = 0). The BLUE
estimator for �⇤ is therefore the GLS Random effect estimator
b
⇤GLS�RE =
⇣
X⇤0⌃
�1X⇤⌘�1
X⇤0⌃
�1y.
8.3. THE RANDOM EFFECT MODEL 119
For implementation of b⇤GLS , we need to work out ⌃
�1.
Exercise 58. Verify that w is homoskedastic and in particular that V ar (wit|X⇤) =
�2" + �2
u for all i = 1, ..., N and t = 1, ..., T.
Since (see exercise 55)
P[D] =1
TDD0,
then
⌃ = �2"INT + T�2
uP[D]
= �2"INT � �2
"P[D] + �2"P[D] + T�2
uP[D]
= �2"M[D] + �2
1P[D].
where �21 = �2
" + T�2u. Therefore,
(8.3.3) ⌃
�1=
1
�2"M[D] +
1
�21
P[D]
and
b
⇤GLS�RE =
X⇤0✓
1
�2"M[D] +
1
�21
P[D]
◆
X⇤��1
X⇤✓
1
�2"M[D] +
1
�21
P[D]
◆
y.
Exercise 59. Verify that 1�2"
M[D] +1�21P[D] is indeed the inverse of ⌃, that is
�
�2"M[D] + �2
1P[D]
�
✓
1
�2"M[D] +
1
�21
P[D]
◆
= INT
(easy, if you remember the properties of M[D] and P[D].)
Exercise 60. 1) Verify that b
⇤GLS�RE can be also written as
(8.3.4) b⇤GLS�RE =
X⇤0✓
M[D] +�2"
�21
P[D]
◆
X⇤��1
X⇤0✓
M[D] +�2"
�21
P[D]
◆
y.
2) Verify that premultiplying all variables of model (8.3.2) by M[D]+�"
�1P[D] transforms it into
a classical regression model, so that b
⇤GLS�RE can be obtained at once by applying OLS to
8.3. THE RANDOM EFFECT MODEL 120
the transformed model. 3) Verify that the operator M[D] +�"
�1P[D] can be also written as
(8.3.5) M[D] +�"�1
P[D] = INT �✓
1� �"�1
◆
P[D].
The operator in (8.3.5), M[D] + (�"/�1)P[D], transforms any conformable variable that
pre-multiplies in quasi-mean deviations, or partial deviations, in the sense that it only removes
a portion of the group-mean from the variable. For this reason, the coefficients on time-
constant variables are identified in the RE model: time-constant variables when premultiplied
by M[D]+(�"/�1)P[D] are not wiped out, but rescaled by a factor �"/�1. The RE model under
the GLS transformation is therefore
(8.3.6)⇥
M[D] + (�"/�1)P[D]
⇤
y =
⇥
M[D] + (�"/�1)P[D]
⇤
X⇤�⇤+
⇥
M[D] + (�"/�1)P[D]
⇤
w
and you may wish to verify that indeed
V ar�⇥
M[D] + (�"/�1)P[D]
⇤
w|X⇤ = �2
"INT .
8.3.1. The Feasible GLS. The feasible version of b
⇤GLS�RE , say b
⇤FGLS�RE , the one
that is actually implemented in econometric softwares, can be obtained through the method
by Swamy and Arora (1972). The estimator for �2" is simply s2LSDV in (8.2.6) and that for �2
1
is obtained as follows.
Define the Between residual vector eB as
(8.3.7) eB = P[D]y � P[D]X⇤b
⇤B
where b
⇤B =
⇣
X⇤0P[D]X⇤⌘�1
X⇤0P[D]y. In words, eB is the residual vector from the OLS
regression of the group means of y on the group means of X⇤. The resulting estimator, b⇤B,
is referred to in the panel data literature as the Between estimator1. Then, based on eB,
1Technical note: I maintain that no column of X is either time-constant or already in group-mean deviations,
so that both b
LSDV
and b
⇤B
are uniquely defined (in fact, with such an assumption X⇤0P[D]X⇤
and X 0M[D]Xare both non-singular). Indeed, this is only made for simplicity, since it is possible to prove that s2
B
and s2LSDV
8.3. THE RANDOM EFFECT MODEL 121
construct the Anova estimator for �21 as
s2B =
e
0BeB
N � k � 1
.
Exercise 61. Prove that
E�
s2B�
= �2" + T�2
u.
Same hint as for exercise 56: First, on noting that y is determined by the right hand
side of (8.3.2) prove that eB = M[
P[D]X⇤]
P[D]w, then elaborate the conditional mean of
w
0P[D]M[
P[D]X⇤]
P[D]w using the trace operator as we did for s2, finally apply the law of
iterated expectations.
Solution. Replacing the formula of b⇤B into the right hand side of equation (8.3.7) gives
eB =
I � P[D]X⇤⇣
X⇤0P[D]X⇤⌘�1
X⇤0P[D]
�
P[D]y
=
I � P[D]X⇤⇣
X⇤0P[D]X⇤⌘�1
X⇤0P[D]
�
�
P[D]X⇤�⇤
+ P[D]w�
=
I � P[D]X⇤⇣
X⇤0P[D]X⇤⌘�1
X⇤0P[D]
�
P[D]w
= M[
P[D]X⇤]
P[D]w.
Therefore,
e
0BeB = w
0P[D]M[
P[D]X⇤]
P[D]w
= w
0M[
P[D]X⇤]
P[D]w
where the first equality follows from the idempotence of M[
P[D]X⇤]
and the second from
P[D]M[
P[D]X⇤]
= M[
P[D]X⇤]
P[D]
and the idempotence of P[D].
are uniquely defined even if b
LSDV
and b
⇤B
are not. The proof requires that all inverse matrices in the residuals
formulas are replaced with generalized inverse matrices. But don’t worry, I won’t pursue it further.
8.3. THE RANDOM EFFECT MODEL 122
Upon noticing that e
0BeB is a scalar,
e
0BeB = tr w0M
[
P[D]X⇤]
P[D]w
= tr M[
P[D]X⇤]
P[D]ww
0,
and so
E�
e
0BeB|X⇤�
= E⇣
tr M[
P[D]X⇤]
P[D]ww
0|X⇤⌘
= tr E⇣
M[
P[D]X⇤]
P[D]ww
0|X⇤⌘
= tr M[
P[D]X⇤]
P[D]E�
ww
0|X⇤�
= tr M[
P[D]X⇤]
P[D]⌃.
Expressing ⌃ in spectral decomposition,
⌃ = �2"M[D] + �2
1P[D],
yields P[D]⌃ = �21P[D], given that P[D] is idempotent and P[D]M[D] = 0(NT⇥NT ). Hence,
E�
e
0BeB|X⇤�
= tr M[
P[D]X⇤]
P[D]�21.
= �21tr M
[
P[D]X⇤]
P[D]
Then, it remains to prove that tr M[
P[D]X⇤]
P[D] = N � k � 1. Since
M[
P[D]X⇤]
P[D] = P[D] � P[D]X⇤⇣
X⇤0P[D]X⇤⌘�1
X⇤0P[D],
tr M[
P[D]X⇤]
P[D] = tr P[D] � tr P[D]X⇤⇣
X⇤0P[D]X⇤⌘�1
X⇤0P[D]
= tr IN � tr Ik+1
= N � k � 1.
8.4. STATA IMPLEMENTATION OF STANDARD PANEL DATA ESTIMATORS 123
In conclusion, the Feasible GLS, b⇤FGLS�RE , is
b
⇤FGLS�RE =
X⇤0✓
M[D] +s2LSDV
s2BP[D]
◆
X⇤��1
X⇤✓
M[D] +s2LSDV
s2BP[D]
◆
y.
Exercise 62. Prove that
E
✓
e
0LSDV eLSDV
NT �N � k|X
◆
= �2"
(hint: follow the same steps as above, noticing that M[D]w = M[D]".)
Exercise 63. Derive the formula for the subvector of b⇤GLS�RE , say bGLS�RE , estimating
the � vector.
Solution. Simply apply the FWL Theorem to the GLS-transformed RE model in (8.3.6),
noticing that by the well-known properties of orthogonal projectors P[D]P[1NT
] = P[1NT
] (re-
member 1NT = D1N and so 1NT 2 R (D)) so that✓
M[D] +�"�1
P[D]
◆
M[1NT
]
✓
M[D] +�"�1
P[D]
◆
= M[D] +�2"
�21
�
P[D] � P[1NT
]
�
and eventually
bGLS�RE =
⇢
X 0
M[D] +�2"
�21
�
P[D] � P[1NT
]
�
�
X
��1
X 0
M[D] +�2"
�21
�
P[D] � P[1NT
]
�
�
y.
8.4. Stata implementation of standard panel data estimators
Both fixed effects and random effects estimators are implemented through the Stata com-
mand xtreg, with the usual Stata syntax for regression commands: the command is followed
by the name of the dependent variable and then the list of regressors. The noconstant option
is not admitted in this case.
As a preliminary step, however, a panel data declaration is needed to make Stata aware
of which variables in our data identify time and individuals. Suppose that in our data the
individual variable is named id and the time variable time, then the panel data declaration is
8.4. STATA IMPLEMENTATION OF STANDARD PANEL DATA ESTIMATORS 124
carried out by the instruction
xtset id time
The random effect estimator is the default of xtreg, while the fixed effects (LSDV) esti-
mator requires the option fe.
Sometimes, you may find it convenient to implement FE and RE estimators by hands,
using regress rather than xtreg. The greater computational effort may pay for the simple
reason that regress, being the most popular estimation command in Stata, is updated more
frequently to accommodate the most recent developments in statistics and econometrics, and
so has typically more options than any other estimation commands in Stata. To implement
bLSDV and aLSDV at once you may just apply regress to the LSDV model (8.2.3). This
requires to generate a full set of individual dummies from the individual-identifier id in your
panel. This is done through the tabulate command with an option, as follows
tabulate id, gen (id�)
where id_ is a name of choice. If N equals, say, 100, tabulate will add the full set of 100
individual dummies to your data, with names id_1, id_2, ..., id_100 and you can just treat
them as regressors in a regress instruction to get bLSDV as the coefficient estimates for the
X variables and aLSDV as the coefficient estimates for the id_1-id_100 variable. Degrees
of freedom are correctly calculated as NT � N � k and so no correction of standard errors
is needed. Notice that if you include all 100 dummies, then the constant term should be
removed by the noconstant option. Alternatively, you can leave it there and include N � 1
dummies. While the bLSDV estimates remain unchanged, the coefficient estimates on the
included dummies do not. The latter must now be thought of as contrasts with respect to
the constant estimate, which turns out to equal the individual effect estimate peculiar to
the individual excluded from the regression, who is therefore treated as the base individual.
Nothing is lost by choosing either identification strategy.
8.5. TESTING FIXED EFFECTS AGAINST RANDOM EFFECTS MODELS 125
When N is large the foregoing regress strategy is not practical. The bLSDV estimator
can, then, be manually implemented by applying the within transformation, carrying out OLS
on the transformed model and then correct standard errors appropriately. Implementation of
b
⇤FGLS�RE by hands is more tricky and one goes along the following steps: 1) get the two
variance components estimates from within and between regressions; 2) transform variables
(including the constant) in partial deviations and 3) apply OLS to the transformed variables.
Details can be found in a Stata do.file available on the learning space.
I recommend to use always the official xtreg command to implement the standard panel
data estimators in empirical applications, unless strictly necessary to do otherwise (for exam-
ple, if I explicitly ask you to!).
8.5. Testing fixed effects against random effects models
As Hausman (1978) and Mundlak (1978) independently found (in two papers appeared
in the same Econometrica issue!), the RE model is a special case of the FE model. In fact,
while in the former model assumption RE.2 models the relationship between the random
individual components, u, and X (E (u|X) = 0), the latter leaves it completely unrestricted.
In consequence, the RE model is nested into the FE model, so that a test discriminating
between them can be easily implemented with E (u|X) = 0 as the null hypothesis.
I present here two popular tests that, moving from the foregoing consideration, can provide
some guidance in the choice between RE and FE models.
8.5.1. The Hausman’s test. Under Ho : E (u|X) = 0, both LSDV and FGLS-RE
estimators are consistent for N ! 1, but the LSDV estimator is inefficient: redundant
individual effects are included in the regression when they could have rather been regarded
as random disturbances, saving on degrees of freedom. On the other hand, if Ho is not true
the LSDV estimator remains consistent, but FGLS does not, undergoing an omitted-variable
bias. The basic idea of the Hausman’s test (Hausman, 1978), therefore, is that under Ho the
8.5. TESTING FIXED EFFECTS AGAINST RANDOM EFFECTS MODELS 126
statistical difference between the two estimators should be not significantly different from zero
in large samples.
Hausman proves that, under RE.1-RE.3, such difference can be measured by the statistics
H = (bLSDV � bFGLS�RE)0 \Avar (bLSDV � bFGLS�RE)
�1(bLSDV � bFGLS�RE)
and also that
H �!d
�2(k) .
Hausman also provides a useful computational result. He shows that since bFGLS�RE is
asymptotically efficient and bLSDV is inefficient under the null,
Acov (bLSDV � bFGLS�RE , bFGLS�RE) = 0,
so
Acov (bLSDV � bFGLS�RE , bFGLS�RE) = Acov (bLSDV , bFGLS�RE)�Avar (bFGLS�RE)
= 0
and
Avar (bLSDV � bFGLS�RE) = Avar (bLSDV )�Avar (bFGLS�RE) .
Hence,
H = (bLSDV � bFGLS�RE)0h
\Avar (bLSDV )� \Avar (bFGLS�RE)
i�1(bLSDV � bFGLS�RE) .
Wooldridge (2010) (pp. 328-334) evidences two difficulties with the Hausman’s test.
First, Avar (bLSDV ) � Avar (bFGLS�RE) is singular if X includes aggregate variables,
such as time dummies. Therefore, along with the coefficients on time-constant variables, also
those on aggregate variables must be excluded from the Hausman statistics.
8.5. TESTING FIXED EFFECTS AGAINST RANDOM EFFECTS MODELS 127
Second, and more importantly, if RE.3 fails, then, on the one hand, the asymptotic dis-
tribution of H is not standard even if RE.2 holds, so that H would be of little guidance in
detecting violations of RE.2, with an actual size that may be significantly different from the
nominal size. On the other hand, H is designed to detect violations of RE.2 and not RE.3.
In fact, if RE.2 holds both LSDV and FGLS-RE are consistent, regardless of RE.3, and H
converges in distribution rather then diverging, which means that the probability of rejecting
RE.3 when it is false does not tend to unity as N ! 1, making H inconsistent. The solution
is so to consider H as a test of RE.2 only, but in a version that is robust to violations of RE.3.
The approach I describe next is well suited to solve both difficulties at once.
8.5.2. The Mundlak’s test. Mundlak (1978) asks the following question. Is it possible
to find an estimator that is more efficient than LSDV within a framework that allows corre-
lation between individual effects, taken as random variables, and X? To provide an answer,
he takes the move from model (8.2.3) and supposes that the individual effects are linearly
correlated with regressors according to the following equation
↵ = 1N⇡0 +�
D0D��1
D0X⇡ + u
with E (↵|X) = Eh
↵| (D0D)
�1D0Xi
, and so E (u|X) = 0. Pre-multiplying both sides of the
foregoing equation by D and then replacing the right-hand side of the resulting equation into
(8.2.3) yields
(8.5.1) y = 1NT⇡0 +X� + P[D]X⇡ +Du+ ",
which is evidently a RE model extended to the inclusion of the P[D]X regressors. Model (8.5.1)
springs up from a restriction in (8.2.3) and hence seems promising for more efficient estimates.
But this is not the case. Mundlak proves, in fact, that FGLS-RE applied to equation (8.2.3)
returns the LSDV estimator, bLSDV for the � coefficients, bB � bLSDV for the ⇡ coefficients
8.5. TESTING FIXED EFFECTS AGAINST RANDOM EFFECTS MODELS 128
and b0B for the constant term ⇡0, where b0B and bB are the components of the between
estimator, b⇤B, presented in Section 8.3.1.
To summarize Mundlak’s results
• The standard LSDV estimator for � in the FE model (equation (8.2.3)) is the FGLS-
RE estimator for � in the general RE model (8.5.1)
• The standard FGLS-RE estimator in the RE model (equation (8.3.2)) can be equiv-
alently obtained as a constrained FGLS estimator applied to the general RE model
(8.5.1) with constraints ⇡ = 0.
Therefore, the validity of the RE model can be tested by applying a standard Wald test of
joint significance for the null hypothesis that ⇡ = 0 in the context of Mundlak’s equation
(8.5.1):
M = (bLSDV � bB)0 \Avar (bLSDV � bB)
�1(bLSDV � bB) .
Under H0 : ⇡ = 0, M �!d
�2(k).
Hausman and Taylor (1981) proves that the statistics H and M are numerically identical
(for a simple proof see also Baltagi (2008)). Wooldridge (2010), p. 334, nonetheless, rec-
ommends using the regression-based version of the test because it can be made fully robust
to violations of RE.3 (for example, heteroskedasticity and/or arbitrary within-group serial
correlation) using the standard robustness options available for regression commands in most
econometric packages. In addition, it is relatively easy to detect and solve singularity problems
in the context of regression-based tests.
8.5.3. Stata implementation. The Stata implementation of most results in this section
is demonstrated through a Stata do file available on the course learning space.
8.6. LARGE-SAMPLE RESULTS FOR THE LSDV ESTIMATOR 129
8.6. Large-sample results for the LSDV estimator
8.6.1. Introduction. This section proves consistency and asymptotic normality of the
LSDV estimator, then describes the heteroskedasticity and within-group serial correlation
consistent covariance estimator and finally provides a remark for practitioners.
Notation is standard. X denotes the (NT ⇥ k) regressors matrix (of all time-varying
regressors) and is partitioned by stacking individuals
(8.6.1) X =
0
B
B
B
B
B
B
B
B
B
B
@
X1
...
Xi
...
XN
1
C
C
C
C
C
C
C
C
C
C
A
,
with Xi indicating the (T ⇥ k) block of observations peculiar to individual i = 1, ..., N. Simi-
larly, observations in the (NT ⇥ 1) vectors y and " are stacked by individuals.
The projection matrix M[D] projects onto the space orthogonal to that spanned by the
columns of the individual dummies matrix D and any conformable vector that is post-multiplied
to it gets transformed into group mean deviations. It is not hard to see that M[D] is a block
diagonal matrix, with blocks all equal to
M[1T
] = IT � 1T10T
T.
So,
(8.6.2) M[D] =
2
6
6
6
6
6
6
6
4
M[1T
] 0 · · · 0
0 M[1T
]. . . ...
... . . . . . .0
0 · · · 0 M[1T
]
3
7
7
7
7
7
7
7
5
.
8.6. LARGE-SAMPLE RESULTS FOR THE LSDV ESTIMATOR 130
The LSDV estimator for the coefficients on the X regressors, �, is given by
bLSDV =
�
X 0M[D]X��1
X 0M[D]y.
Strict exogeneity is maintained throughout:
SE: E ("|X) = 0.
The following random sampling assumption is invoked for the asymptotic normality of
bLSDV and the consistency of the bLSDV asymptotic covariance estimator:
RS: There is a sample of size n = NT , such that the elements of the sequence
{(yi Xi) , i = 1, ..., N} are independent (NB not necessarily identically distributed)
random matrices.
8.6.2. Large-sample properties of LSDV. Let V ar ("|X) = ⌃, where ⌃ is an arbi-
trary and unknown p.d. matrix.
8.6.3. Consistency. The following assumptions hold.
LSDV.1: p limN!1
⇣
X0M[D]⌃M[D]XN
⌘
= lim
N!1E⇣
X0M[D]⌃M[D]XN
⌘
⌘ Q⌃, a positive def-
inite and finite matrix
LSDV.2: p limN!1
⇣
X0M[D]XN
⌘
= Q a positive definite and finite matrix
Exercise 64. (This has been done in class) Prove that under LSDV.1 and LSDV.2
p limN!1 bLSDV = �.
8.6.4. Asymptotic normality. Assumptions LSDV.1 and LSDV.2 hold along with RS
and the following
8.6. LARGE-SAMPLE RESULTS FOR THE LSDV ESTIMATOR 131
LSDV.3: V ar ("|X) = ⌃, where
⌃ =
2
6
6
6
6
6
6
6
4
⌃1 0 · · · 0
0 ⌃2. . . ...
... . . . . . .0
0 · · · 0 ⌃N
3
7
7
7
7
7
7
7
5
is a block diagonal (NT ⇥NT ) positive definite matrix. Notice that the blocks of ⌃ are
arbitrary and heterogenous, so that both arbitrary correlation across the time observations
of the same individual (referred to as within-group serial correlation) and heteroskedasticity
across individuals and over time are permitted. What is not permitted by the block-diagonal
structure is correlation of the " realizations across different individuals.
Now focus on the generic individual i = 1, ..., N and notice that, given the block-diagonal
form of M[D] as in (6.1.7),
M[D]X =
2
6
6
6
6
6
6
6
4
M[1T
] 0 · · · 0
0 M[1T
]. . . ...
... . . . . . .0
0 · · · 0 M[1T
]
3
7
7
7
7
7
7
7
5
0
B
B
B
B
B
B
B
B
B
B
@
X1
...
Xi
...
XN
1
C
C
C
C
C
C
C
C
C
C
A
=
0
B
B
B
B
B
B
B
B
B
B
@
M[1T
]X1
...
M[1T
]Xi
...
M[1T
]XN
1
C
C
C
C
C
C
C
C
C
C
A
The proof of asymptotic normality for bLSDV parallels that in 7.2.2 with the only difference
that now the random objects whence we move are not k ⇥ 1 vectors at the observation level
but k ⇥ 1 vectors at the individual level, X 0iM[1
T
]"i, i = 1, ..., N .
First, by strict exogeneity E�
X 0iM[1
T
]"i�
= 0 and hence
V ar�
X 0iM[1
T
]"i|X�
= E�
X 0iM[1
T
]"i"0iM[1
T
]Xi|X�
= X 0iM[1
T
]⌃iM[1T
]Xi,
8.7. A ROBUST COVARIANCE ESTIMATOR 132
so that
V ar�
X 0iM[1
T
]"i�
= E�
X 0iM[1
T
]⌃iM[1T
]Xi�
.
Then, averaging across individuals
1
N
NX
i=1
V ar�
X 0iM[1
T
]"i�
=
1
N
NX
i=1
E�
X 0iM[1
T
]⌃iM[1T
]Xi�
= E
✓
X 0M[D]⌃M[D]X
N
◆
.
Therefore,
lim
N!1
1
N
NX
i=1
V ar�
X 0iM[1
T
]"i�
= lim
N!1E
✓
X 0M[D]⌃M[D]X
N
◆
⌘ Q⌃,
which is a finite matrix by assumption LSDV.1, so that the Lindberg-Feller theorem applies
to yield pN
N
NX
i=1
X 0iM[1
T
]"i ⌘X 0M[D]"p
N�!d
N (0, Q⌃) .
Finally, since
pN (bLSDV � �) ⌘
✓
X 0M[D]X
N
◆�1 X 0M[D]"pN
�!d
Q�1X0M[D]"pN
,
pN (bLSDV � �) �!
dN�
0, Q�1Q⌃Q�1�
and the asymptotic covariance matrix of bLSDV is given by
(8.6.3) Avar (bLSDV ) =1
NQ�1Q⌃Q
�1.
8.7. A Robust covariance estimator
Arellano (1987) demonstrates that given the (T ⇥ 1) LSDV residual vector
eLSDV,i = M[1T
]yi �M[1T
]XibLSDV ,
8.7. A ROBUST COVARIANCE ESTIMATOR 133
i = 1, ..., N, a consistent estimator for the asymptotic covariance matrix of bLSDV in equation
(8.6.3) is given by the White’s estimator:
(8.7.1) \Avar (bLSDV ) =�
X 0M[D]X��1
X 0M[D]ˆ
⌃M[D]X�
X 0M[D]X��1
,
where ˆ
⌃ is a block diagonal matrix with generic block given by eLSDV,ie0LSDV,i. More formally,
ˆ
⌃ = eLSDV e0LSDV ⇤DD0.
Remark 65. The estimator in (8.7.1) is robust to arbitrary heteroskedasticity and within-
group serial correlation. Stock and Watson (2008) prove that in the LSDV model the White’s
estimator correcting for heteroskedasticity only, where ˆ
⌃ is a diagonal matrix with generic
element e2LSDV,it (see the first formula of section 9.6.1 in Greene (2008)), is inconsistent for
N ! 1. The crux of Stock and Watson’s argument is essentially algebraic, in that demeaned
residuals are correlated over time by construction and this correlation does not vanish for
N ! 1. The recommendation for practitioners is then to correct for both heteroskedasticity
and within-group serial correlation using the estimator (8.7.1), which is not affected by the
Stock and Watson’s critique.
Remark 66. In Stata the robust covariance matrix of LSDV is computed easily by using
the xtreg command with the options fe and vce(cluster id), where id is the name of the
individual categorical variable in your Stata data set.
A similar correction can be carried out for POLS and FGLS-RE. For POLS we have
\Avar�
b
⇤POLS
�
=
⇣
X⇤0X⇤⌘�1
X⇤0ˆ
⌃X⇤⇣
X⇤0X⇤⌘�1
,
where
ˆ
⌃ = ePOLSe0POLS ⇤DD0,
and the POLS residual vector defined as in equation (8.2.10), whereas for FGLS-RE we have
8.8. UNBALANCED PANELS 134
\Avar�
b
⇤FGLS�RE
�
=
⇣
X⇤0b
⌦
�1X⇤⌘�1
X⇤0b
⌦
�1/2ˆ
⌃
b
⌦
�1/2X⇤⇣
X⇤0b
⌦
�1X⇤⌘�1
,
where
ˆ
⌃ = eFGLS�REe0FGLS�RE ⇤DD0,
and the FGLS-RE residual vector defined as
eFGLS�RE = y �X⇤bFGLS�RE .
Remark 67. In Stata the robust asymptotic covariance matrices of POLS and FGLS-RE
is estimated by using, respectively, the regress and the xtreg, re commands, both with the
option vce(cluster id), as in the LSDV case.
8.8. Unbalanced panels
All of the methods so far have been described with a balanced panel data set in mind, but
nothing prevents applying the same methods to unbalanced panels (different numbers of time
observations across individuals).
Unbalanced panels only require a slight change in notation. As always we index individuals
by i = 1, ..., N , but now the size of each individual cluster, or group, of observations varies
across individuals and so the time index is t = 1, ..., Ti. This implies the following three facts.
(1) As in balanced panels, each observation in the data is uniquely identified by the two
indexes: the pair (i, t) identifies the t.th observation of the the i.th individual.
(2) Differently from balanced panels, the group size, Ti, is no longer constant across
clusters.
(3) Differently from balanced panels, the sample-size is n =
PNi=1 Ti.
The LSDV estimator is implemented without any problem either creating individual dummies
or taking variables in group-mean deviations, where group means are at the individual level.
8.8. UNBALANCED PANELS 135
The random effect estimator requires only some algebraic modifications in the formulas allow-
ing for unbalancedness. Arellano estimator also requires simple modifications in notations to
accommodate unbalancedness: there is now a (Ti ⇥ 1) LSDV residual vector given by
eLSDV,i = M[
1
T
i
]
yi �M[
1
T
i
]
xibLSDV ,
i = 1, ..., N, and so matrix ˆ
⌃ in (8.7.1) is a block diagonal matrix with blocks that are now of
different size. The notation using the Hadamard product does not instead require adjustments
to unbalancedness.
CHAPTER 9
Robust inference with cluster samplings
9.1. Introduction
The Panel-data sets considered in these notes, with a large individual dimension and a
small time dimension, are an example of one-way clustering. If the data-set is balanced, there
are n = NT observations clustered into N individual groups, each comprising T observations.
If the data-set is unbalanced, as often the case with real-world panels, there are n =
PNi=1 Ti
observations clustered into N individual groups, each comprising Ti observations, i = 1, ..., N .
One-way clusterings can be observed also in cross-section data. Think for example of a
large sample of students clustered into many schools. The data structure parallels exactly
that of an unbalanced panel, just index schools by i = 1, ..., N and students within schools
by t = 1, ..., Ti. So, any observation in the data is uniquely identified by the values of i and
t. In other words, observation (i, t) refers to the t.th student in the i.th school. Therefore,
under random sampling of schools and arbitrary sampling of students within schools all of
the statistical methods described in chapter 8 can be conveniently used. This means that
pooled OLS, fixed and random effect estimators can be applied to clustered cross-sections.
The F-test on individual effects can be used to gauge the presence of latent heterogeneity.
The robust Hausman test can be used to discriminate between fixed and random effects and,
importantly, the White-Arellano estimator described in Section 8.7 can be used for computing
robust standard errors. For more on the parallel between unbalanced panel data and one-way
clustered cross sections see chapter 20 in Wooldridge (2010).
Dealing with one-way clustering is an important advance in econometrics. It is often the
case, however, that real-world economic data have a multi-dimensional structure, so clustering
136
9.2. TWO-WAY CLUSTERING 137
can occur along more than one dimension. In a student survey, for example, there could be an
additional level of clustering given by teachers, or classes, within schools. Similarly, patients
can be clustered along the two dimensions, not necessarily nested, of doctors and hospitals.
In a cross-sectional data-set of bilateral trade-flows, the cross-sectional units are the pairs of
countries and these are naturally clustered along two dimensions: the first and the second
country in the pair (Cameron et al., 2011). ln matched-employers-employees data there is the
worker dimension, the firm dimension and the time dimension (Abowd et al., 1999).
Is it possible to do inference that is robust to multi-way clustering as we do inference that is
robust to one-way clustering? A recent paper by Cameron et al. (2011) offers a computationally
simple solution extending the White estimator to multi-way contexts. In essence, their method
boils down to computing a number of one-way robust covariance estimators, that are then
combined linearly to yield the multi-way robust covariance estimator. It is, therefore, crucial
for the accuracy of the multi-way estimator that the one-way estimators be also accurate,
and so that the data-set have dimensions with a large number of clusters. Such asymptotic
requirement makes the analysis in Cameron et al. (2011) not well suited for dealing with both
individual- and time-clustering in the typical micro-econometric panel data set, where T is
fixed. Indeed, their Monte Carlo experiments show that the robust covariance estimator have
good finite-sample properties in data-sets with dimensions of 100 clusters.
To illustrate the method I focus on two-way clustering, using a notation that is close to
that inCameron et al. (2011).
9.2. Two-way clustering
Notation is general enough to embrace cases in which cluster affiliations are not sufficient
to uniquely identify an observation. There is a data-set with n observations indexed by i 2
{1, ..., n}. Observations are clustered into two dimensions, g 2 {1, ..., G} and h 2 {1, ..., H} .
Asymptotics is for both G and H ! 1. The data-sets that I have in mind are, for example,
• Survey of students with, at least moderately large numbers of teachers and schools
9.2. TWO-WAY CLUSTERING 138
• Survey of patients with, at least, moderately large numbers of doctors and hospitals
• Bilateral trade-flows data with, at least, a moderately large number of countries.
• Matched-employers-employees data with, at least, moderately large numbers of firms
and workers
For each dimension, it is known to which cluster a given observation i = 1, ..., n belongs. This
information is contained in the mappings g : {1, ..., n} ! {1, ..., G}
g (i) = [g 2 {1, ..., G} : observation i belongs to cluster g] , i = 1, ..., n.
and h : {1, ..., n} ! {1, ..., H}
h (i) = [h 2 {1, ..., H} : observation i belongs to cluster h] , i = 1, ..., n.
From the mappings g and h we can also construct the n⇥G dummy variables matrix DG and
the n⇥H dummy variables matrix DH , as the following definitions indicates
Definition 68. Let
dig =
8
<
:
1 if g (i) = g
0 else,
i 2 {1, ..., n}, g 2 {1, ..., G}, and
dih =
8
<
:
1 if h (i) = h
0 else,
i 2 {1, ..., n}, h 2 {1, ..., H}. Then, DG and DH are the n⇥G and n⇥H matrices with (i, g)
element dig and (i, h) element dih, respectively.
Given g and h, we can define an intersection dimension, say G\H, such that each cluster in
G\H contains only observations that belong to one unique cluster in {1, ..., G} and one unique
cluster in {1, ..., H} . This yields the matrix of dummy variables DG\H . By construction, the
9.2. TWO-WAY CLUSTERING 139
number of clusters in the G \H dimension is at most G⇥H. For example if
DG=
0
B
B
B
B
B
B
B
B
B
B
B
B
B
@
1 0 0
1 0 0
0 1 0
0 1 0
0 1 0
0 0 1
1
C
C
C
C
C
C
C
C
C
C
C
C
C
A
, DH=
0
B
B
B
B
B
B
B
B
B
B
B
B
B
@
1 0
1 0
1 0
0 1
0 1
1 0
1
C
C
C
C
C
C
C
C
C
C
C
C
C
A
,
then
DG\H=
0
B
B
B
B
B
B
B
B
B
B
B
B
B
@
1 0 0 0
1 0 0 0
0 1 0 0
0 0 1 0
0 0 1 0
0 0 0 1
1
C
C
C
C
C
C
C
C
C
C
C
C
C
A
.
This framework allows that in a survey of patients, for example, there could be more than
one patients admitted to the same hospital and under the assistance of the same doctor. Or,
similarly, that in a panel data matching workers with firms the same worker can move across
firms over time or that, conversely, the same firm may employ different workers over time.
Then, define three n ⇥ n indicator matrices: SG= DGDG0 , SH
= DHDH0 and SG\H=
DG\HDG\H0
It is easy to verify that:
• SG has ijth entry equal to one if observations i and j share any cluster g in {1, ..., G}
; zero otherwise.
• SH has ijth entry equal to one if observations i and j share any cluster h in {1, ..., H};
zero otherwise.
9.2. TWO-WAY CLUSTERING 140
• SG\H has ijth entry equal to one if observations i and j share any cluster g in
{1, ..., G} and any cluster h in {1, ..., H}; zero otherwise.
Also, the iith entries in SG, SH and SG\H equal one for all i = 1, ..., n, so the three indicator
matrices have main diagonals with all unity elements.
Consider now a linear regression model allowing for two-way clustering
yi, = x
0i,� + "i
i = 1, ..., n and let
" =
0
B
B
B
B
B
B
B
B
B
B
@
"1...
"i,...
"n
1
C
C
C
C
C
C
C
C
C
C
A
.
Assumptions LRM.1-LRM.3 hold. Assumption LRM.4 is here replaced with a more general
one permitting arbitrary heteroskedasticity and maintaining zero correlation only between
errors peculiar to observations that share no cluster in common. For example, the latent error
of patient i is not correlated to the latent error of patient j only if the two subjects are under
the assistance of different doctors, say g (i) 6= g (j), and in different hospitals, h (i) 6= h (j).
Formally,
LRM.4b: V ar ("|X) ⌘ ⌃ = E (""0|X) with E ("i"j |X) = 0 unless g (i) = g (j) or
h (i) = h (j), i, j = 1, ..., n.
Importantly, LRM.4b can equivalently be expressed as
(9.2.1) ⌃ = E�
""0 ⇤ SG|X�
+ E�
""0 ⇤ SH |X�
� E�
""0 ⇤ SG\H |X�
,
where the symbol ⇤ stands for the element-by-element matrix product (also known as Hadamard
product) between matrices with equal dimension (verify equivalence of LRM.4b and (9.2.1)).
9.2. TWO-WAY CLUSTERING 141
As we know, OLS, in this case, are consistent and unbiased but not efficient. More impor-
tantly, OLS standard errors are biased, and so we need a two-way robust covariance estimator
for inference. The covariance estimator devised by Cameron et al. (2011) is the combination
of three one-way covariance estimators a la White. It is constructed along the following steps.
Carry out OLS, obtain the OLS residuals
ei,g(i),h(i) = yi,g(i),h(i) � x
0i,g(i),h(i)b
i = 1, ..., n and stack them into the n⇥ 1 column vector
e =
0
B
B
B
B
B
B
B
B
B
B
@
e1,g(1),h(1)...
ei,g(i),h(i)...
en,g(n),h(n)
1
C
C
C
C
C
C
C
C
C
C
A
.
The first one-way covariance estimator is
\Avar (b)G=
�
X 0X��1
X 0ˆ
⌃
GX�
X 0X��1
where ˆ
⌃
G= ee
0 ⇤ SG. \Avar (b)G
is a White estimator that is robust to clustering only along
the G dimension.
The second one-way covariance estimator is
\Avar (b)H
=
�
X 0X��1
X 0ˆ
⌃
HX�
X 0X��1
where ˆ
⌃
H= ee
0 ⇤ SH . \Avar (b)H
is a White estimator that is robust to clustering only along
the H dimension.
The third one-way covariance estimator is
\Avar (b)G\H
=
�
X 0X��1
X 0ˆ
⌃
G\HX�
X 0X��1
9.3. STATA IMPLEMENTATION 142
where ˆ
⌃
G\H= ee
0 ⇤SG\H . \Avar (b)G\H
is a White estimator that is robust to clustering only
along the G \H dimension.
Finally, the two-way robust covariance estimator is
(9.2.2) \Avar (b) = \Avar (b)G+
\Avar (b)H� \Avar (b)
G\H.
\Avar (b) is robust to clustering along both G and H dimensions and is the estimator that
is used to construct our robust tests.
Remark 69. Writing \Avar (b) as
\Avar (b) =�
X 0X��1
X 0⇣
ˆ
⌃
G+
ˆ
⌃
H � ˆ
⌃
G\H⌘
X�
X 0X��1
and then considering equation (9.2.1) uncovers the analogy principle on which the two-way
robust covariance estimator rests.
Remark 70. Cameron et al. (2011) also present a general multi-way version of \Avar (b),
which is derived from a simple extension of the foregoing analysis. The additional cost is only
in terms of a more cumbersome notation. For the formulas I refer you to that paper.
9.3. Stata implementation
While there is no official command for the two-way \Avar (b) in Stata, it can be simply
implemented by means of three one-way OLS regressions. Suppose that in our data-set the
two categorical variables for dimensions G and H are called doctor and hospital. You can
assemble \Avar (b) along the following steps.
(1) Create the categorical variable for the intersection dimension, G \ H, through the
following instruction
egen doc�hosp = group (doctor hospital)
where doc_hosp is a name of choice.
9.3. STATA IMPLEMENTATION 143
(2) Implement the first regress instruction with the option vce(cluster doctor) and
then save the covariance matrix estimate through the command: matrix V_d=e(V)
(V_d is a name of choice).
(3) Implement the second regress instruction with the option vce(cluster hospital)
and then save the covariance matrix estimate with: matrix V_h=e(V) (V_h is a
name of choice).
(4) Implement the last regress instruction with the option vce(cluster doc_hosp) and
then save the covariance matrix estimate with: matrix V_dh=e(V) (V_dh is a name
of choice).1
(5) Finally, work out the two-way robust covariance estimator by executing: matrix
V_robust=V_d+V_h-V_dh (V_robust is a name of choice). To see the content of
V_robust do: matrix list V_robust. The robust standard errors are just the
square roots of the main diagonal elements in V_robust.
1It may happen that clusters in the intersection dimension are all singletons (i.e. each cluster has only one
observation). In this case Stata will refuse to work with the option vce(cluster doc_hosp). This is no
problem, though, since correcting standard errors when clusters are singletons is clearly equivalent to correcting
for heteroskedasticity. Therefore, instead of vce(cluster doc_hosp), simply write vce(robust).
CHAPTER 10
Issues in linear IV and GMM estimation
10.1. Introduction
The conditional-mean-independence of " and x maintained by P.2 (Section 2.1) often fails
in economic structures, where some of the x variables are chosen by the economic subjects
and as such may depend on the latent factors at the equilibrium. These x variables are said
endogenous.
In economics, think of a production function, where (some of) the observable input quan-
tities are under the firm’s control. The same consideration holds for the education variable
in a wage equation. These are all cases of omitted variable bias (Section 4.7), which makes
standard estimation techniques not usable.
As we have seen in Section 4.7.1, the proxy variables solution maintains that there is infor-
mation external to the model that is able to fully explain the correlation between observed and
unobserved variables. For example observed IQ scores, clearly redundant in a wage equation
with latent ability, are an imperfect measure of latent ability, but the discrepancy between the144
10.1. INTRODUCTION 145
two variables is likely to be unrelated with the individual education levels. Such information,
so close to the latent variable, is often unavailable, though.
If the latent variables are invariant across individual and/or over time and there is a panel-
data set, the endogeneity problem is solved by applying the panel-data methods introduced
in Chapter 8. But not always panel data are available and even when they are, the disturbing
omitted factors may not meet the time-constancy requirement. For example, idiosyncratic
productivity shocks may well be related to input factors in the estimation of a production
function.
Neither proxy variables or panel data methods are generally usable when endogeneity
springs from reverse causality. In the strip, Wally questions the exogeneity of the exercise
variable as a determinant of individual health, hinting for an endogeneity bias due to reverse
causality. If the exercise activity is indeed affected by the health status, exercise would depend
on the observable and unobservable determinants of health, and so cannot be exogenous.
Instrumental variables (IV) and Generalized method of moment (GMM) estimators offer
a general solution to the endogeneity problem. Roughly speaking, they solve the endogeneity
problem into two stages. The first stage attempts to identify the exogenous-variation compo-
nents of the x, through a set of exogenous variables, some of which are external to the model,
said instrumental variables. The second stage applies regression analysis using only the first-
stage exogenous components as explanatory variables. IV and GMM methods are preferred
tools of econometric analysis, compared to alternative techniques, since often the first stage
can be justified on the ground of economic theory.
There are various IV GMM applications showing the methods of this chapter: iv.do using
mus06data.dta, IV_GMM_panel.do using costfn.dta and IV_GMM_DPD.do and abest.do both
using abdata.dta. There is also a Monte Carlo application implemented by bias_in_AR1_LSDV.do.
10.2. THE METHOD OF MOMENTS 146
10.2. The method of moments
The method of moments estimates the parameters of interest by replacing population
moment conditions with their sample analogs. Almost all popular estimators can be thought
of as methods of moments estimators. Below there are two examples.
10.2.1. The linear regression model. Consider the linear model of Chapter 1 and the
system of moment conditions (1.2.3)
E (xy) = E�
xx
0��.
So, the true coefficient vector, �, solve the population moment conditions and is equal to
� = E (xx
0)
�1E (xy). By the analogy principle a consistent estimator for �, b�⇤, will satisfy
the system of k analog sample moment conditions:
1
n
nX
i=1
xi
⇣
yi � x
0ib�⌘
= 0.
Hence,
b�⇤=
nX
i=1
xix0i
!�1 nX
i=1
xiyi =�
X 0X��1
X 0y,
which is exactly the OLS estimator.
10.2.2. The Instrumental Variable (IV) regression model in the just identified
case. Consider the linear model of Chapter 1 but without assumption P.3, E ("|x) = 0, or
even the weaker P.3b, E (x") = 0. This means that some of the variables in x are potentially
endogenous, that is related in some way to ". Assume, instead, conditional mean independence
for a L ⇥ 1 vector of variablesz, that is E ("|z) = 0, with L = k. The vector z is generally
different from x, if it is not then we are back to the classical regression model and there is no
10.2. THE METHOD OF MOMENTS 147
endogeneity problem. Then, as before using the law of iterated expectations we have
E (z") = Ez
[E (z"|z)]
= Ez
[zE ("|z)]
= 0.
So, there are k moment conditions in the population
E⇥
z
�
y � x
0��⇤
= 0
or equivalently
E (zy) = E�
zx
0��.
So, the true coefficient vector, �, solve the population moment conditions and is equal to
� = E (zx
0)
�1E (zy). By the analogy principle a consistent estimator for �, b�⇤⇤
, will satisfy
the system of k analog sample moment conditions:
1
n
nX
i=1
zi
⇣
yi � x
0ib�⌘
= 0.
Hence,
b�⇤⇤
=
nX
i=1
zix0i
!�1 nX
i=1
ziyi =�
Z 0X��1
Z 0y,
which is the classical IV estimator.
The intuition is straightforward: since the true coefficients solve the population moment
conditions, if the sample moments provide good estimates for the population moments, then
one might expect that the estimator solving the sample moment conditions will provide good
estimates of the true coefficients.
What if there are more moment conditions than unknown parameters, that is if L > k?
Then we turn to GMM estimation.
10.2. THE METHOD OF MOMENTS 148
10.2.3. The Generalized Method of Moments. GMM estimation is general: it can
be applied to both linear and non-linear models and in the over-identified case L > k. To see
this, define the column vector of observables in the population w ⌘ (y x0z
0)
0. There are L > k
population moments collected into the (L⇥ 1) column vector m (�) :
m (�) ⌘ E [f (w,�)]
and suppose that the following population moment conditions hold
m (�) = 0.
Now consider the L sample moments
m
⇣
b�⌘
⌘ 1
n
NX
i=1
f
⇣
wi, b�⌘
and the L sample moment conditions
m
⇣
b�⌘
= 0
hence there are L equations and k unknowns so that no estimator b� can solve the system of
sample moment conditions. Instead, there exists a b� that can make m
⇣
b�⌘
as close to zero as
possible:
b�GMM = argmin
b�Q⇣
b�⌘
where Q⇣
b�⌘
⌘ m
⇣
b�⌘0
Am
⇣
b�⌘
is a quadratic criterion function of the sample moments and
A is a positive definite matrix weighting the squares and the cross-products of the sample
moments in Q⇣
b�⌘
.
Note that Q⇣
b�⌘
� 0 and since A is positive definite, Q⇣
b�⌘
= 0 only if m
⇣
b�⌘
= 0.
Thus, Q⇣
b�⌘
can be made exactly zero in the just identified case and is strictly greater than
zero in the over-identified case.
10.2. THE METHOD OF MOMENTS 149
10.2.4. The TSLS estimator. It is not hard to prove that the well-known Two Stages
Least Squares estimator (TSLS) in the overidentified linear model belongs to the class of GMM
estimators. Consider the linear regression model of Section 10.2.2 with L > k instruments.
Then, there are the following population moments
m (�) ⌘ E⇥
z
�
y � x
0��⇤
and population moment conditions
E⇥
z
�
y � x
0��⇤
= 0
The L sample moments are collected into the (L⇥ 1) vector m
⇣
b�⌘
m
⇣
b�⌘
⌘ 1
n
nX
i=1
zi
⇣
yi � x
0ib�⌘
⌘ 1
nZ 0
⇣
y �Xb�⌘
Suppose we choose a quadratic criterion function with the following weighting matrix
A ⌘
1
n
nX
i=1
ziz0i
!�1
= n�
Z 0Z��1
.
Then
Q⇣
b�⌘
⌘ 1
n
⇣
y �X b�⌘0
Z�
Z 0Z��1
Z 0⇣
y �Xb�⌘
with the following normal equations for the minimization problem:
@Q⇣
b�⌘
@b�⌘ 2
nX 0Z
�
Z 0Z��1
Z 0⇣
y �Xb�⌘
= 0
that solved for b� yield the TSLS estimator
b�TSLS ⌘⇣
X 0Z�
Z 0Z��1
Z 0X⌘�1
X 0Z�
Z 0Z��1
Z 0y.
or more compactly
b�TSLS ⌘�
X 0P[Z]X��1
X 0P[Z]y.
10.3. STATA IMPLEMENTATION OF THE TSLS ESTIMATOR 150
The estimator’s name derives from the fact that it is computed into two stages:
(1) Regress each column of X on Z using OLS to obtain the OLS fitted values of X:
Z (Z 0Z)
�1 Z 0X = P[Z]X. Thus, X = P[Z]X + M[Z]X, where P[Z]X is an approxi-
mately exogenous component, whose covariance with " goes to zero as n ! 1, and
M[Z]X is a residual, potentially endogenous, component. Only P[Z]X is used in the
second stage.
(2) Regress y on the fitted values, P[Z]X, to obtain TSLS.
If the population moment conditions are true, then Q⇣
b�TSLS
⌘
should not be significantly
different from zero. This provides a test for the validity of the L� k over-identifying moment
conditions based on the following statistic (Hansen-Sargan test)
S ⌘ nQ⇣
b�TSLS
⌘
⇠ �2(L� k) .
Exercise 71. Prove that if L = k, TSLS collapses to IV
Solution: Z 0X is invertible, so
b�TSLS ⌘⇣
X 0Z�
Z 0Z��1
Z 0X⌘�1
X 0Z�
Z 0Z��1
Z 0y
=
�
Z 0X��1
Z 0Z�
X 0Z��1
X 0Z�
Z 0Z��1
Z 0y
=
�
Z 0X��1
Z 0y
10.3. Stata implementation of the TSLS estimator
It is implemeted by the command ivregress 2sls followed by the names of the dependent
variable, the included exogenous and, within parentheses, all the right-hand-side endogenous
and the excluded exogenous (the instruments) as follows
ivregress 2sls depvar indepvars (endog_vars = instruments), options
10.4. STATA IMPLEMENTATION OF THE (LINEAR) GMM ESTIMATOR 151
10.4. Stata implementation of the (linear) GMM estimator
It is implemeted by ivregress gmm followed by the names of the dependent variable,
the included exogenous and, within parentheses, all the right-hand-side endogenous and the
excluded exogenous (the instruments) as follows
ivregress gmm depvar indepvars (endog_vars = instruments), options
10.4.1. Choosing the weighting matrix . The weighting matrix in the optimal two-
step GMM estimator is
(10.4.1) A =
⇣
Z 0ˆ
⌃Z/n⌘�1
(see Hansen 1982). A is a consistent estimate of the inverse of V arh
zi
⇣
yi � x
0i�⌘i
.
Choice of ˆ
⌃ :
• If " is homoskedastic and independent then ˆ
⌃ = I (the resulting GMM estimator col-
lapses to TSLS). It’s implemented through the ivregress gmm option: wmatrix(unadjusted).
• If " is heteroskedastic and independent then ˆ
⌃ is a diagonal matrix with generic diag-
onal element equal to the squared residual from some one-step consistent estimator,
the TSLS for example:
ˆ
⌃ =
0
B
B
B
B
B
B
B
@
e21 0 · · · 0
0 e22...
... . . .0
0 · · · 0 e2n
1
C
C
C
C
C
C
C
A
with ei = yi � x
0ib�TSLS , i = 1, ..., n. It’s implemented through the ivregress gmm
option: wmatrix(robust). It’s the default.
• If errors are clustered, with N clusters, then ˆ
⌃ is a block diagonal matrix with generic
block equal to the outer product of the residuals peculiar to the corresponding cluster.
10.6. DURBIN-WU-HAUSMAN EXOGENEITY TEST 152
Again residuals are taken from a one-step consistent regression (TSLS):
ˆ
⌃ =
0
B
B
B
B
B
B
B
@
ˆ
⌃1 0 · · · 0
0
ˆ
⌃2...
... . . .0
0 · · · 0
ˆ
⌃N
1
C
C
C
C
C
C
C
A
with ˆ
⌃i = eie0i, i = 1, ..., N . Notice that in this case ei = yi�x
0ib�TSLS is a vector and
not a scalar: it is the vector of residual observations peculiar to cluster i = 1, ..., N. It’s
implemented through the ivregress gmm option: wmatrix(cluster cluster_var).
10.4.2. Iterative GMM. The GMM procedure can be iterated by adding the option
igmm. The resulting estimator is asymptotically equivalent to the two-step estimator. However,
Hall (2005) suggests that it may have a better finite-sample performance.
10.5. Robust Variance Estimators
The less efficient, but computationally simpler and still consistent, TSLS estimator is
often used in estimation. Its robust variance-covariance matrix V ar⇣
b�TSLS
⌘
is consistently
estimated as
\V ar
⇣
b�TSLS
⌘
=
�
X 0P[Z]X��1
X 0P[Z]ˆ
⌃P[Z]X�
X 0P[Z]X��1
,
where ˆ
⌃ is chosen according to the various departures from homoskedasticity and independence
spelled out above. The Stata implementation of the three variance-covariance estimators is
through the ivregress options: vce(unadjusted), vce(robust), vce(cluster cluster_var).
10.6. Durbin-Wu-Hausman Exogeneity test
A conventional Hausman test can be always implemented, based on the Hausman’s sta-
tistics measuring the statistical difference between IV and OLS estimates. It is not robust to
10.6. DURBIN-WU-HAUSMAN EXOGENEITY TEST 153
heteroskedastic and clustered errors, though. Wu suggest an alternative. But before do this
exercise, which will prove useful in the derivations below.
Exercise 72. Prove that the TSLS estimator for �2 is
b2,TSLS =
⇣
X 02P
[
M[X1]Z1]
X2
⌘�1X 0
2P[
M[X1]Z1]
y
Solution. Applying the FWL Theorem to the second-stage regression
b2,TSLS =
⇣
X 02P[Z]M
[
P[Z]X1]P[Z]X2
⌘�1X 0
2P[Z]M[
P[Z]X1]P[Z]y
By Lemma 12 P[Z] = P[X1] + P[
M[X1]Z1]
, so that P[Z]X1 = X1 and
b2,TSLS =
�
X 02P[Z]M [X1]P[Z]X2
��1X 0
2P[Z]M [X1]P[Z]y.
But then P[Z] = P[X1]+P[
M[X1]Z1]
also assures that P[Z]M [X1] = P[
M[X1]Z1]
, proving the result.
The DWH test provides a robust version of the H test. It maintains instruments valid-
ity, E ("|Z) = 0 and is based on the so called control-function approach, which recasts the
endogeneity problem as a misspecification problem affecting the structural equation
(10.6.1) y = X� + ",
X = (X1 X2), � =
�
�01 �
02
�
’, Z = (X1 Z1) and " = u + ⌫⇡. The component u is such that
E (u|X) = 0 and ⌫ is the n⇥k2-matrix of the errors in the first-stage equations of the variables
X2. As such, ⌫ is responsible for endogeneity of X2.
Replacing ⌫ in (10.6.1) with the residuals from the first stage regressions, ⌫ = M[Z]X2,
makes the DWH test operational as a simple test of joint significance for ⇡ in the auxiliary
OLS regression
(10.6.2) y = X� +M[Z]X2⇡ + u⇤.
10.6. DURBIN-WU-HAUSMAN EXOGENEITY TEST 154
The test works well since under the alternative of ⇡ 6= 0, OLS estimation of the auxiliary
regression yields the TSLS estimators. This is proved as follows.
y =
�
P[Z] +M[Z]
�
X� +M[Z]X2⇡ + u⇤
and so
y = P[Z]X� +M[Z]X� +M[Z]X2⇡ + u⇤
but M[Z]X� = M[Z]X2�2 since by Lemma 12 M[Z] = M[X1] � P[
M[X1]Z1]
. Therefore,
y = P[Z]X� +M[Z]X2 (�2 + ⇡) + u⇤
and since P[Z]X and M[Z]X2 are orthogonal the FWL Theorem assures that the OLS estimator
for � is
bTSLS =
�
X 0P[Z]X��1
X 0P[Z]y.
and also
\�2 + ⇡ =
�
X 02M[Z]X2
��1X 0
2M[Z]y
=
�
X 02M[Z]X2
��1⇣
X 02M[X1]y �X 0
2P[
M[X1]Z1]
y
⌘
= Kb2,OLS + (I �K)b2,TSLS ,
with K ⌘�
X 02M[Z]X2
��1 �X 0
2M[X1]X2�
and where the last equation follows from Exercise 72.
Therefore
b⇡ = Kb2,OLS + (I �K)b2,TSLS � b2,TSLS
= K (b2,OLS � b2,TSLS) ,
proving that the test indeed follow the Hausman test general principle of assessing the dis-
tance between an efficient estimator and a consistent but inefficient estimator, under the null
hypothesis.
10.7. ENDOGENOUS BINARY VARIABLES 155
One great advantage of the DWH test over a conventional Hausman test is that it can
be easily robustified for heteroskedasticity and/or clustered errors by estimating (10.6.2) with
regress and a suitable robust option, vce(cluster clustervar) for example.
DWH can be immediately implemented in Stata through the ivregress postestimation
command estat endogenous.
10.7. Endogenous binary variables
The linear IV-GMM approach outlined so far fits the case of binary endogenous variables
producing consistent estimates. However, a first-stage regression fully accounting for the
binary structure of the endogenous variables may provide considerable efficiency gains. The
implied model (non-linear) is as follows
yi = x1i�1 + x2i�2 + ✏i
x⇤2i = x1i⇡1 + zi⇡2 + ⌫i
x2i =
8
>
<
>
:
1 if x⇤2i > 0
0 otherwise
(✏i, ⌫i) ⇠ N
2
4
0,
0
@
�2 ⇢�2
⇢�21
1
A
3
5 .
It is estimated by the Stata procedure
treatreg depvar indepvars, treat(endog_var = instruments) other_options
through either ML (default) or a consitent two-step procedure (twostep option).
10.9. INFERENCE WITH WEAK INSTRUMENTS 156
10.8. Testing for weak instruments
• Staiger and Stock’s rule of thumb: partial F tests in the first stage regression >
10. It is not rigorous, tends to reject too often weak intruments and has no obvious
implementation when there are more than one endogenous variables.
• Stock and Yogo’s (2005) two tests overcome all of the above difficulties. They are
both based on the on the minimum eigenvalue of the matrix analog of the partial
F test, a statistics introduced by Cragg and Donald (1993) to test nonidentification.
Importantly, the large-sample properties for both tests have been derived under the
assumption of homoskedastic and independent errors. Caution must be taken, then,
when drawing conclusions from the tests if the errors are thought to departure from
those hypotheses.
Both procedures are implemented by the ivregress postestimation command estat firststage.
10.9. Inference with weak instruments
Conditional inference on the endogenous variables coefficients in the presence of weak in-
struments is implemented through command condivreg by Mikusheva and Poi (2006). Theory
reviewed and expanded in Andrews et al. (2007). The command produces three alternative
confidence sets for the coefficient of the endogenous regressor obtained from the conditional
LR, Anderson-Rubin (option ar) and LM statistics (option lm). The syntax of condivreg is
similar to that of ivregress.
10.9.1. Three stages Least Squares. It’s a system estimator including structural equa-
tions for all endogenous variables. Identification is ensured by standard (sufficient) rank and
(necessary) order conditions. It is seldomly used as it is inconsistent in the presence of het-
eroskedastic errors, which is the norm in most micro applications. The Stata command is
reg3.
10.10. DYNAMIC PANEL DATA 157
10.10. Dynamic panel data
Situations in which past decisions have an impact on current behaviour are ubiquitous in
economics. For example in the presence of input adjustment costs, short-run input demands
depend also on past input levels. In such cases fitting a static model to data will lead to what
is referred to as dynamic underspecification. With a panel data set, however, it is possible to
implement a dynamic model from the outset to in order to describe the phenomena of interest.
To make things simple let us get started with the simple autoregressive process
(10.10.1) yit = ↵+ �yi,t�1 + ✏it
t = 1, ..., T , i = 1, ..., N .
Model (10.10.1) can be easily extended to allow for time invariant individual terms:
(10.10.2) yit = �yi,t�1 + ↵i + ✏it
t = 1, ..., T , i = 1, ..., N . In vector notation, stacking time observations for each individual,
yi = �y�1i + ↵i1T + ✏i
i = 1, ..., N, where
yi(T⇥1)
=
0
B
B
B
B
B
B
B
B
B
B
@
yi1...
yit...
yiT
1
C
C
C
C
C
C
C
C
C
C
A
, y�1i(T⇥1)
=
0
B
B
B
B
B
B
B
B
B
B
@
yi,0...
yi,t�1
...
yi,T�1
1
C
C
C
C
C
C
C
C
C
C
A
, "i(T⇥1)
=
0
B
B
B
B
B
B
B
B
B
B
@
"i1...
"it...
"iT
1
C
C
C
C
C
C
C
C
C
C
A
Notice that for each individual there are T + 1 observations available in the data set,
from yi1 to yiT , but only T are usable since one is lost to taking lags. The problem here is
that y�1i is not strictly exogenous. Given (10.10.2), the t.th realization of y�1i is yi,t�1 =
f(y0, ✏i1, ✏i2, ..., ✏it�1) and so all future realizations of y�1i, from yi,t = f(y0, ✏i1, ..., ✏it) to
10.10. DYNAMIC PANEL DATA 158
yi,T�1 = f(y0, ✏i1, ...✏it, ...✏i,T�1), depend on ✏i,t, which makes E (✏i,t|yi,0, ...yi,T�1) = 0 fail
(exercise: Can you work out the exact expression of the right-hand side of yi,t�1 = f(y0, ✏i1, ✏i2, ..., ✏it�1)?).
Nonetheless, we may maintain conditional-mean independence between ✏i,t and the t.th and
more remote values of y�1i, say y
t�1i = (yi,t�1, yi,t�2, ..., yi,0)
0 using the notation in Arellano
(2003). More formally, we maintain throughout (see Arellano (2003) for a discussion)
A.1: E�
✏it|yt�1i ,↵i
�
= 0 for all t = 1, ..., T
Assumption A.1 is also considered in Wooldridge (chapter 11, 2010), where it is referred to as
sequential exogeneity conditional on the unobserved effect. It may be convenient sometimes to
maintain also the following (sequential) conditional homoskedasticity assumption
A.2: E�
✏2it|yt�1i ,↵i
�
= �2 for all t = 1, ..., T
It is not hard to prove that Equation (10.10.1) and Assumption A.1 implies the following
(prove it using the LIE and ✏i,t�j = yi,t�j � �yi,t�j�1 � ↵i)
A.3: E�
✏it✏it�j |yt�1i ,↵i
�
= 0, for all j = 1, ...t� 1
10.10.1. Inconsistency of the LSDV Estimator. Since y�1i is not strictly exogenous
the LSDV estimator, �LSDV , is inconsistent for N ! 1. Nickell (1981) was the first to derive
the inconsistency. Given,
�LSDV = � +
1NT
P
i
P
t
�
yi,t�1 � yi.�1
�
(✏it � ✏i.)
1NT
P
i
P
t
�
yi,t�1 � yi.�1
�2 ,
he showed that
plim1
NT
X
i
X
t
�
yi,t�1 � yi.�1
�
(✏it � ✏i.) = E
1
T
TX
t=1
�
yi,t�1 � yi.�1
�
(✏it � ✏i.)
!
=
= � 1
T 2
T � 1� T� + �T
(1� �)2�2✏ 6= 0.
10.10. DYNAMIC PANEL DATA 159
Hence, the bias vanishes for T ! 1, but it does not for N ! 1 and T fixed. For this reason,
the LSDV estimator is inaccurate in panel data sets with large N and small T and is said to
be semi-inconsistent (see also Sevestre and Trognon, 1996).
Since Nickell (1981) a number of consistent IV and GMM estimators have been proposed
in the econometric literature as an alternative to LSDV. Anderson and Hsiao (1981) suggest
two simple IV estimators that, upon transforming the model in first differences to eliminate
the unobserved individual heterogeneity, use the second lags of the dependent variable, either
differenced or in levels, as an instrument for the differenced one-time lagged dependent variable.
Arellano and Bond (1991) propose a GMM estimator for the first differenced model which,
relying on all available lags of y�1i as instruments, is more efficient than Anderson and Hsiao’s.
Ahn and Smith (1995), upon noticing the Arellano and Bond estimator uses only linear moment
restrictions, suggest a set of non linear restrictions that may be used in addition to the linear
one to obtain more efficient estimates. Blundell and Bond (1998) observe that with highly
persistent data first-differenced IV or GMM estimators may suffer of a severe small sample
bias due to weak instruments. As a solution, they suggest a system GMM estimator with
first-differenced instruments for the equation in levels and instrument in levels for the first-
differenced equation. Some of the foregoing methods are nowadays very popular and are
surveyed below.
10.10.2. The Anderson and Hsiao IV Estimator. One typical solution is to take
model (10.10.2) in first differences to eliminate the individual effects:
(10.10.3) yit � yi,t�1 = �(yi,t�1 � yi,t�2) + ✏it � ✏i,t�1.
This makes the disturbances MA(1) with unit root, and so induces correlation between the
lagged endogenous variables and the disturbances. This problem can be solved by using
instruments for 4yi,t�1. Anderson and Hsiao (1982)Anderson and Hsiao (1982) suggest using
either yi,t�2, or 4yi,t�2 since these are correlated with 4yi,t�1 but are uncorrelated with
10.10. DYNAMIC PANEL DATA 160
✏it � ✏i,t�1. It is an exactly identified IV estimator, consistent under Assumption A.1, but
generally non optimal and with a high root mean squared error in applications.
10.10.3. The Arellano and Bond GMM estimator. Arellano and Bond (1991) pro-
pose a more efficient estimator, using a larger set of instruments. Upon noticing that, given
(10.10.2), yi1 = f(✏i1), yi2 = f(✏i1, ✏i2), ..., yit = f(✏i1, ✏i2, ..., ✏it) and that after differencing
the first useable period in the sample is t = 3:
yi2 � yi1 = �(yi1 � yi0) + ✏i2 � ✏i1,
one finds that the value yi1 is a valid instrument for (yi2 � yi1). In fact, yi1 is correlated with
(yi2 � yi1) and given Assumption 1, E [yi1 (✏i3 � ✏i2)] = 0.
In the next period, t = 4, we have
yi4 � yi3 = �(yi3 � yi2) + ✏i4 � ✏i3,
and both yi1 and yi2 are valid instruments for (yi3 � yi2), and so on.
This approach adds an extra valid instruments with each forward period, so that in the
last period T we have (yi0, yi1, yi2, ..., yi,T�2) . For each individual i, the matrix of instruments
is therefore
Zi =
0
B
B
B
B
B
B
B
B
B
B
B
B
B
@
yi0 0 0 0 0 0 . . . . . 0
0 yi0 yi1 0 0 0 . . . . . 0
0 0 0 yi0 yi1 yi2 . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
0 0 0 0 0 0 . yi0 yi1 yi2 . yi,T�2
1
C
C
C
C
C
C
C
C
C
C
C
C
C
A
Stacking individuals the overall matrix of instrumental variables is
Z =
⇥
Z 01, Z
02, ..., Z
0N
⇤0,
10.10. DYNAMIC PANEL DATA 161
and model (10.10.3) can be reformulated more compactly
�y = ��y�1
+�✏.
where
• �y is a (N(T � 1)⇥ 1) vector;
• �y�1
is a (N(T � 1)⇥ 1) vector;
• �✏ is a (N(T � 1)⇥ 1) vector
The number of instruments is L = T (T � 1)/2. So, Z is a (N(T � 1) ⇥ L) matrix. The
instrumental variables satisfy for each individual i the (L ⇥ 1) vector of population moment
conditions m(�) ⌘ Eh
Z0i�✏i
i
= 0, where �✏i stands for the i.th block of �✏. We can define
the (L ⇥ 1) vector of sample moment conditions m(�) =
1NZ 0
(�✏) =
1NZ 0
(�y � ��y�1
).
Since L > 1, this is an overidentified case and GMM estimation is needed.
If we also consider Assumption A.2, that is ✏, beyond being not serially correlated, is also
homoskedastic, the optimal GMM estimator can be obtained in one step. It minimizes the
following criterion function
(10.10.4) Q (b�) = m(b�)0Am(b�)
where according to what seen in Subsection 10.4.1 A is a consistent estimator of the inverse of
(10.10.5) V ar⇣
Z0i�✏i
⌘
= �2E⇣
Z0iGZi
⌘
up to an irrelevant positive scalar and
10.10. DYNAMIC PANEL DATA 162
G =
0
B
B
B
B
B
B
B
B
B
B
B
B
B
@
2 �1 0 0 . . . 0 0
�1 2 �1 0 . . . 0 0
0 �1 2 �1 . . . . .
. . . . . . . . .
0 0 0 0 . . . 2 �1
0 0 0 0 . . . �1 2
1
C
C
C
C
C
C
C
C
C
C
C
C
C
A
.
The Arellano-Bond one-step estimator b�1 uses
A =
1
N
NX
i=1
Z0iGZi
!�1
in (10.10.4) and so
b�1 =
2
4
�y�1
0Z
NX
i=1
Z0iGZi
!�1
Z 0�y�1
3
5
�1
⇥
2
4
�y�1
0Z
NX
i=1
Z0iGZi
!�1
Z 0�y
3
5
Without homoskedasticity (that is without Assumption A.2), b�1 is no longer optimal, but it
remains consistent and as such can be used to construct the optimal two-step estimator b�2
along the lines described in Subsection 10.4.1. Therefore, b�2 minimizes (10.10.4) with
A =
1
N
NX
i=1
Z0i�e1i�e
01iZi
!�1
10.10. DYNAMIC PANEL DATA 163
and where �e
1i = �yi � b�1�y�1i is the individual-level residual vector from the one-step
estimator:
b�2 =
2
4
�y�1
0Z
NX
i=1
Z0i�e
1i�e
1
0iZi
!�1
Z 0�y�1
3
5
�1
⇥
2
4
�y�1
0Z
NX
i=1
Z0i�e
1i�e
1
0iZi
!�1
Z 0�y
3
5
If the ✏it are iid(0,�2✏ ), b�1 and b�2 are asymptotically equivalent.
To test instrument validity one can apply the Hansen-Sargan test of overidentifying re-
strictions:
S =
NX
i=1
Z0i�e
2i
!0 NX
i=1
Z0i�e
2i�e
2
0iZi
!�1 NX
i=1
Z0i�e
2i
!
where �e
2i = �yi � b�2�y�1i are the individual-level residuals from the two-step estimator.
Under H0: Eh
Z0i�✏i
i
= 0 8i = 1, ..., N , S ⇠A�2L�1.
A second specification test suggested by Arellano and Bond (1991) is that testing lack of
AR(2) correlation in �e
1
or �e
2
, which must hold under Assumption A.1. The AR(2) test
under the null has a limiting standard normal distribution.
10.10.3.1. Inference issues. Monte Carlo studies tend to show that estimated standard
errors from 2-step GMM estimators are severely downward biased in finite samples (Arellano
and Bond 1991). This is not the case for 1-step GMM standard errors, which instead are
virtually unbiased. A possible explanation for this finding is that the weighting matrix in 2-
step GMM estimators depend on estimated parameters whereas that in 1-step GMM estimators
does not. Windmeijer (2005) proves that in fact a large portion of the finite sample bias of
2-step GMM standard errors is due to the variation of estimated parameters in the weighting
matrix. He derives both a general bias-correction and a specific one for panel data models
with predetermined regressors as in the Arellano and Bond model.
10.10. DYNAMIC PANEL DATA 164
Monte Carlo experiments in Bowsher (2002) show that the Sargan test based on the full
instrument set has zero power when T , and consequently the moment conditions, becomes too
large for given N .
10.10.4. Blundell and Bond (1998) System estimator. Blundell and Bond (1998)
demonstrate that in the presence of � close to unity instruments in levels are weakly correlated
with �y�1
leading to what is known in the econometric literature as a weak instrument bias.
This is easily seen by considering the following example taken from Blundell and Bond. Let
T = 2, then after taking the model in first differences there is only a cross-section available
for estimation:
4yi,2 = �4yi,1 +4✏i,2, i = 1, ..., N.
and only one moment condition
1
N
NX
i=1
(4yi,2 � �4yi,1) yi,0 = 0.
To what extent is yi,0 related to 4yi,1? To answer this question it suffices to work out the
reduced form for 4yi,1:
4yi,1 = (� � 1) yi,0 + ↵i + ✏i,1
from which it is clear that the closer � to unity the weaker the correlation between yi,0 and
4yi,1.
To solve the problem they suggest exploiting the following additional moment restrictions
(10.10.6) E [(yi,t � �yi,t�1)4yi,t�1] = 0, t = 2, ..., T
which are valid if along to Assumption A.1, we maintain that the process for yi,t is mean-
stationary, that is
A.4:
E (yi,0|↵i) =↵i
1� �
10.10. DYNAMIC PANEL DATA 165
Assumption A.4 is justified if the process started in the distant past. Starting from the model
at observation t = 0 and going backward in time recursively
yi,0 = �yi,�1 + ↵i + ✏i,0
= �2yi,�2 + �↵i + ↵i + �✏i,�1 + ✏i,0
= �3yi,�3 + �2↵i + �↵i + ↵i + �2✏i,�2 + �✏i,�1 + ✏i,0
.
.
=
↵i
1� �+
1X
t=0
�t✏i,�t ⌘↵i
1� �+ ui,0
where E (ui,0|↵1) = 0 by Assumption A.1.
That the moment restrictions hold under Assumptions 1 and 4 can be seen for t = 2
E [(yi,2 � �yi,1)4yi,1] =
E [(↵i + ✏i,2) [(� � 1) yi,0 + ↵i + ✏i,1]] =
E
(↵i + ✏i,2)
(� � 1)
✓
↵i
1� �+ ui,0
◆
+ ↵i + ✏i,1
��
=
E [(↵i + ✏i,2) [(� � 1)ui,0 + ✏i,1]] = 0
Thus, Blundell and Bond (1998) suggest a system GMM estimator, which also uses instru-
ments in first differences for the equation in levels.
Hahn (1999) evaluates the efficiency gains brought by exploiting the stationarity of the
initial condition as done by Blundell and Bond, finding that it is substantial also for large T .
Stata’s xtabond performs the Arellano and Bond GMM estimator. Then, there is xtdpdsys,
which implements the GMM system estimator. Third, xtdpd, is a more general command that
allows more flexibility than both xtabond and xtdpdsys. Finally, the user-written xtabond2
(Roodman 2009) is certainly the most powerful code in Stata to implement dynamic panel
data models.
10.10. DYNAMIC PANEL DATA 166
10.10.5. Application. Arellano and Bond (1991) show their methods estimating a dy-
namic employment equation on a sample of UK manufacturing companies. Their data set
in Stata format is contained in abdata.dta. The dofile IV_GMM_DPD.do implements simpler
versions of their model though differenced and system GMM using xtabond and xtabond2.
The dofile abbest.do by D. M. Roodman replicates exactly the Arellano and Bond’s results
using xtabond2.
10.10.6. Bias corrected LSDV. IV and GMM estimators in dynamic panel data models
are consistent for N large, so they can be severely biased and imprecise in panel data with a
small number of cross-sectional units. This certainly applies to most macro panels, but also
micro panels where heterogeneity concerns force the researcher to restrict estimation to small
subsamples of individuals.
Monte Carlo studies (Arellano and Bond 1991, Kiviet 1995 and Judson and Owen 1999)
demonstrate that LSDV although inconsistent has a relatively small variance compared to IV
and GMM estimators. So, an alternative approach based upon the correction of LSDV for the
finite sample bias has recently become popular in the econometric literature. Kiviet (1995)
uses higher order asymptotic expansion techniques to approximate the small sample bias of the
LSDV estimator to include terms of at most order 1/(TN). Monte Carlo evidence therein shows
that the bias-corrected LSDV estimator (LSDVC) often outperforms the IV-GMM estimators
in terms of bias and root mean squared error (RMSE). Another piece of Monte Carlo evidence
by Judson and Owen (1999) strongly supports LSDVC when N is small as in most macro
panels. In Kiviet (1999) the bias expression is more accurate to include terms of higher order.
Bun and Kiviet (2003), simplify the approximations in Kiviet (1999).
Bruno (2005a) extends the bias approximations in Bun and Kiviet (2003) to accommodate
unbalanced panels with a strictly exogenous selection rule.Bruno (2005b) presents the new
user’s written Stata command xtlsdvc to implement LSDVC.
10.10. DYNAMIC PANEL DATA 167
Kiviet (1995) shows that the bias approximations are even more accurate when there is
a unit root in y. This makes for a simple panel unit-root test based on the bootstrapped
standard errors computed by xtlsdvc.
10.10.6.1. Estimating a dynamic labour demand equation for a given industry. Unlike the
xtabond and xtabond2 applications of Subsection 10.10.5, here we do not use all information
available to estimate the parameters of the labour demand equation in abdata.dta. Instead,
we follow a strategy that, exploiting the industry partition of the cross-sectional dimension
as defined by the categorical variable ind, lets the slopes be industry-specific. This is easily
accomplished by restricting the usable data to the panel of firms belonging to a given industry.
While such a strategy leads to a less restrictive specification for the firm labour demand, it
causes a reduced number of cross-sectional units for use in estimation, so that the researcher
must be prepared to deal with a potentially severe small sample bias in any of the industry
regressions. Clearly, xtlsdvc is the appropriate solution in this case.
The demonstration is kept as simple as possible considering regressions for only one in-
dustry panel, ind=4.
The following instructions are implemented in a Stata-do file
10.10. DYNAMIC PANEL DATA 168
Part 2
Non-linear models
CHAPTER 11
Non-linear regression models
11.1. Introduction
Non-linear models present three main difficulties.
(1) Closed-form solutions for estimators are generally not available
(2) Marginal effects do not coincide with the model coefficients and vary over the sample.
(3) Latent heterogeneity components in cross-sections or panel data require special at-
tention.
The are two do-files demonstrating the methods of this chapter: nlmr.do using the data set
mus10data.dta and nlmr2.do using the data set mus17data.dta. Both data-sets are from
Cameron and Trivedi (2009).
11.2. Non-linear least squares
The regression model specifies the mean of y conditional on a vector of exogenous explana-
tory variables x by using some known, non-linear functional form
E (y|x) = µ (x,�) .
Or, equivalently,
y = µ (x,�) + u
where u = y�E (y|x). The non-linear least square estimator, bNLS , minimizes the non-linear
residual sum of squares
Q =
nX
i=1
[yi � µ (x,b)]2 .
170
11.3. POISSON MODEL FOR COUNT DATA 171
11.3. Poisson model for count data
Let y 2 N be a count variable: doctor visits, car accidents, etc. The Poisson regression
model is a non-linear regression model with
(11.3.1) E (y|x) = exp�
x
0��
.
Or, equivalently,
y = exp�
x
0��
+ u
where u = y � E (y|x).
Equation (11.3.1) =) E⇥
y � exp�
x
0��
|x⇤
= 0
and by the Law of Iterated expectations there are zero covariances between u and x:
(11.3.2) Ey,x�
x
⇥
y � exp�
x
0��⇤
= 0.
11.3.1. Estimation. There is a random sample {yi,xi} , i = 1, ..., n, for estimation.
Given the population moment restrictions (11.3.2), estimation can be carried out with a lim-
ited set of assumptions within a GMM set-up: by the analogy principle the consistent GMM
estimator bGMM solve the system of k sample analog restrictions
(11.3.3)nX
i=1
xi⇥
yi � exp�
x
0i��⇤
= 0.
Equations (11.3.3) are also the first-order-conditions in NLS =) bNLS = bGMM in this
case.
Alternatively, we can maintain a Poisson density function for y with mean µ:
f (y) =e�µµy
y!.
Importantly, the Poisson model has the equidispersion property: V ar (y) = E (y) = µ.
11.3. POISSON MODEL FOR COUNT DATA 172
Letting µ = exp (x0�) we end up with the conditional log-likelihood function
lnL (y1...yn|x1...xn,�) =
nX
i=1
ln
⇢
exp [�exp (x0i�)] exp (x
0i�)
yi
yi!
�
=
nX
i=1
⇥
�exp�
x
0i��
+ yix0i� � ln (yi!)
⇤
and the ML estimator bML that maximizes it:
bML is consistent : bML !p
�
The covariance matrix estimator of bML :
ˆV (bML) =
nX
i=1
µixix0i
!�1
(11.3.4)
It is easily seen that the k first order conditions that maximize lnL coincide with the equa-
tions in (11.3.3), so that bML = bGMM . This proves two things: 1) The GMM estimator is
asymptotically efficient if the conditional mean function is correctly specified and the density
function is Poisson; 2) the ML estimator is consistent even if the poisson density is not the
correct density function, as long as the conditional mean is correctly specified. In such cases,
when the likelihood function is not correctly specified, we refer to the ML estimator as a
pseudo ML estimator and a robust covariance matrix estimator should be used for inference
rather than (11.3.4):
ˆVrob (bML) =
nX
i=1
µixix0i
!�1 " nX
i=1
(yi � µi)2xix
0i
#
nX
i=1
µixix0i
!�1
.
With equidispersion V ar (y|x) = E (y|x) = µ:
• (yi � µi)2 close to µi =) ˆV (bML) close to ˆVrob (bML)
With overdispersion V ar (y|x) > E (y|x) = µ
• (yi � µi)2 tends to be greater than µi =) ˆV (bML) is inconsistent, with smaller
variance estimates than ˆVrob (bML), which remains consistent.
11.3. POISSON MODEL FOR COUNT DATA 173
The consistency result for the (pseudo) ML estimator holds in general if two conditions are
verified:
(1) The conditional mean is correctly specified
(2) The density function belongs to an exponential family
Definition 73. An exponential family of distributions is one whose conditional log-
likelihood function at a generic observation is of the form
lnL (y|x,�) = a (y) + b [µ (x,�)] + yc [µ (x,�)] .
A member of the family is identified by the numerical values of �.
We verify that Poisson is an exponential family:
• a (y) = �ln (y!),
• b [µ (x,�)] = �exp (x0�) and
• yc [µ (x,�)] = yx0�.
The Normal distribution with a known variance �2
� (y|x,�) = 1
�p2⇡
exp
(
� [y � µ (x,�)]2
2�2
)
is an exponential family also:
• a (y) = �ln�
�p2⇡�
� y2/2�2,
• b [µ (x,�)] = �µ (x,�)2 /2�2 and
• yc [µ (x,�)] = yµ (x,�) /�2.
The Stata command that implements poisson regression is poisson, with a syntax close
to regress. It computes bML with standard error estimates obtained by ˆV (bML). If the
vce(robust) option is given, then Stata recognizes the more robust pseudo ML set-up and
still provides the bML coefficient estimates, but with the robust covariance matrix ˆVrob (bML) .
11.4. MODELLING AND TESTING OVERDISPERSION 174
11.4. Modelling and testing overdispersion
We start from a specific Poisson density function, conditional on a random scalar ⌫,
f (y|⌫) = e�µ⌫(µ⌫)y
y!
with E (⌫) = 1 and V ar (⌫) = �2. We can find for the unconditional moments of y applying
iterated expectations
E (y) = E⌫ [E (y|⌫)] = µ
and
V ar (y) = E⌫ [V ar (y|⌫)] + V ar [E (y|⌫)] = E⌫ [µ⌫] + V ar [µ⌫] = µ+ µ2�2= µ
�
1 + µ�2�
> µ
so, overdispersion is allowed.
The marginal density function of y, f (y), is what is needed for ML estimation since ⌫ is
not observable. Its generic expression is
f (y) = E⌫
e�µ⌫(µ⌫)y
y!
�
.
To find it in closed form we need to specify the marginal density function for µ. If ⌫ ⇠
Gamma (1,↵), then f (y) is a negative binomial density function, NB�
µ,�2�
, with E (y) = µ
and V ar (y) = µ�
1 + µ�2�
. Clearly if �2= 0, then ⌫ collapses to its unity mean and f (y) is
Poisson.
Specifying µ = exp (x0�) yields the NB regression model and � and �2 are estimated via
ML based on NB⇥
exp (x0�) ,�2⇤
. Testing for overdispersion within this framework boils down
to testing the null hypothesis �2= 0.
The Stata command that implements the NB regression is nbreg, with a syntax close to
regress and poisson. The output also gives the overdispersion (LR) test of �2= 0.
11.4. MODELLING AND TESTING OVERDISPERSION 175
Overdispersion can be tested also under the null hypothesis of �2= 0, therefore under
Poisson regression, against the alternative of V ar (y|x) = µ�
1 + µ�2�
, therefore NB regres-
sion, using a Lagrange Multiplier test. This is based on an auxiliary regression implemented
after poisson estimation using an estimate of [V ar (y|x) /µ]� 1,h
(yi � µi)2 � yi
i
/µi, as the
dependent variable and µi = exp (x0ibML), as the only regressor (no constant). The LM test
is the t-statistic computed for the OLS coefficient estimate of µi.
CHAPTER 12
Binary dependent variable models
12.1. Introduction
Binary dependent variable models have a dependent variable that partitions the sample
into two categories of a given qualitative dimension of interest. For example
• Labour supply. There are two categories: work/not work (univariate binary model).
• Supplementary private health insurance. There are two categories: purchase/not
purchase (univariate binary model)
Binary models are said multivariate when there are multiple dimensions that are possibly
related
• Two related dimensions: [Dimension 1: Being overweight (body mass index > 25)
=) Two categories: yes/not] and [Dimension 2: Job satisfaction =) Two categories:
satisfied/dissatisfied] (bivariate binary model).
• Two related dimensions: [Dimension 1: Identity of immigrants with the host country
=) Two categories: yes/not] and [Dimension 2: Identity of immigrants with the
country of origin =) Two categories: yes/not] (bivariate binary model).
In these notes I focus almost exclusively with univariate binary models, except for a digression
on the bivariate probit model as estimated by Stata’s biprobit.
The do-file bdvm.do is a Stata application on binary models that uses the data set mus14data.dta
from Cameron and Trivedi (2009).176
12.2. BINARY MODELS 177
12.2. Binary models
Let A the event of interest (e.g. “an immigrant identifies with the host-country culture”).
Let the indicator function 1 (A) be unity if event A occurs and zero if not. Define the discrete
random variable y such that
(12.2.1) y = 1 (A) .
Then
• Pr (y = 1) = Pr (A) ⌘ ⇢ and Pr (y = 0) = 1� ⇢.
• E (y) = ⇢ and V ar (y) = ⇢ (1� ⇢) .
We want to asses the impact of x on the probability of A and to do so we model Pr (y = 1|x)
as a function of x.
Since 0 Pr (y = 1|x) 1 a suitable functional form for Pr (y = 1|x) is any cumulative
distribution function evaluated at a linear combination of x, F (x
0�). Accordingly, we specify
(12.2.2) Pr (y = 1|x) = F�
x
0��
.
Two popular choices for F (·) are
• Probit Model: F (·) ⌘ � (·), the Standard Normal distribution
• Logit model: F (·) ⌘ ⇤ (·) ⌘ exp (x0�) / [1 + exp (x0�)] , the Logistic distribution
with zero mean and variance ⇡2/3.
Alternatively, we may model F (·) directly as a linear function of x:
• Linear Probability Model (LPM): F (x
0�) ⌘ x
0�.
Since Pr (y = 1|x) = E (y|x), Model (12.2.2) can always be expressed as the regression model
(12.2.3) y = F�
x
0��
+ u
u = y � E (y|x) .
12.2. BINARY MODELS 178
12.2.1. Latent regression. When F (·) is a distribution function the binary model can
be motivated as a latent regression model. In microeconomics this is a convenient way to
model individual choices.
Introduce the latent continuous random variable y⇤ with
(12.2.4) y⇤ = x
0� + ",
let " be a zero mean random variable that is independent from x and with " ⇠ F , where
F is a distribution function that is symmetric around zero. Then, let y = 1 (y⇤ > 0) . In
our example of Immigrant identity we may think of y⇤ as the utility variation faced by a
subject with observable and latent characteristics x and ", respectively, when he/she decides
to conform to the host-country culture, so that event A occurs if and only if y⇤ > 0.
(12.2.5) =) y = 1
�
" > �x
0��
,
=) Pr (y = 1|x) = Pr�
" > �x
0�|x�
Since " and x are independent Pr (" x
0�|x) = F (x
0�). Moreover, by symmetry of F ,
Pr (" > �x
0�|x) = F (x
0�) and so
Pr (y = 1|x) = F�
x
0��
,
which is exactly Model (12.2.2).
Inspection of (12.2.5) clarifies that V ar (") = �2 and � cannot be separately identified,
since Pr (" > �x
0�) = Pr [("/�) > �x
0(�/�)]. Therefore, to identify �, �2 must be fixed to
some known value. In the probit model �2= 1 and in the logit model �2
= ⇡2/3.
12.2. BINARY MODELS 179
12.2.2. Estimation. There is a random sample {yi,xi} , i = 1, ..., n, for estimation. In
the logit and probit models estimation is carried out via ML. The ML estimator, bML maxi-
mizes the conditional log-likelihood function
lnL (y1...yn|x1...xn,�) =
nX
i=1
�
yiln⇥
F�
x
0i��⇤
+ (1� yi) ln⇥
1� F�
x
0i��⇤
.
We have
bML !p
�
ˆV (bML) =
(
nX
i=1
f (x
0ibML)xix
0i
F (x
0ibML) [1� F (x
0ibML)]
)�1
(12.2.6)
where f is the density function of F and remember that @xF (x) = f (x).
The Stata commands that compute bML and ˆV (bML) in the probit and logit models are,
respectively, probit and logit. The syntax is similar to regress.
The LPM assumes F = X�. So, Equation (12.2.3) is a linear regression model that can be
estimated by regress. In this case the model coefficients are identical to the marginal effects
of interest. But V ar (u|x) = V ar (y|x) = x
0� (1� x
0�), so the model is heteroskedastic and
regress should be supplemented by the vce(robust) option.
12.2.3. Heteroskedasticity. Unlike the non-linear models examined in Chapter 11, in
probit and logit models heteroskedasticity brings about misspecification of the conditional
mean, so that ML estimators of both models become inconsistent. Hence, it makes little sense
to complement probit and logit coefficient estimates with heteroskedasticity-robust standard
error estimates.
Heteroskedasticity can be modeled, though. In the probit model, instead of fixing �2= 1,
one can allow heteroskedasticity by setting �2i = exp (z0i�) , so that
(12.2.7) Pr (yi = 1|x) = �
�
x
0i�/exp
�
z
0i���
.
12.3. COEFFICIENT ESTIMATES AND MARGINAL EFFECTS 180
Stata’s hetprobit estimates this heteroskedastic probit model and, importantly, provides a
LR test for the null of homoskedasticity (�=0).
12.2.4. Clustering. Differently from heteroskedasticity, it makes sense to adjust stan-
dard error estimates to within-cluster correlation. This is the case since within-cluster correla-
tion leaves the conditional expectation of an individual observation unaffected, so that the ML
estimator can be motivated as a partial ML estimator, which remains consistent even if ob-
servations are not independent (see Wooldridge 2010, p. 609). The Stata option vce(cluster
clustervar), therefore, can be conveniently included in both probit and logit statements.
12.3. Coefficient estimates and marginal effects
There is no exact relationship between the coefficient estimates from the three foregoing
models. Amemiya (1981) works out the following rough conversion factors
blogit ' 4bols
bprobit ' 2.5bols
blogit ' 1.6bprobit.
The marginal effect of x at observation i are estimated by logit and probit as
(@xFi)probit = fprobit,ibprobit = ��
x
0ibprobit
�
bprobit
(@xFi)logit = flogit,iblogit = ⇤
�
x
0ibprobit
� ⇥
1� ⇤
�
x
0ibprobit
�⇤
blogit
and by LPM as
(@xFi)ols = bols.
The following relationships hold
(@xF )logit 0.25bols
(@xF )probit 0.4bols
12.4. TESTS AND GOODNESS OF FIT MEASURES 181
The post-estimation command margins with the option dydx(varlist) estimates marginal
effects for each of the variables in varlist. Marginal effects can be estimated at a point x
(conventionally, the sample mean when variables are continuous and in this case the option
atmean must be supplied) or can be averaged over the sample (default).
12.4. Tests and Goodness of fit measures
Parameter restrictions can be tested by Wald tests (test) and LR tests (lrtest). As ex-
plained above, hetprobit, besides producing coefficient estimates, provides an heteroskedas-
ticity test.
The most common Goodness of fit measures reported in logit or probit outputs are:
• The overall percent correctly predicted (opcp). Define the predictor yi of yi as
yi =
8
<
:
1 if F⇣
x
0ˆ�⌘
� 0.5
0 else
9
=
;
The opcp is given by the number of times yi = yi over n. A problem with this measure
is that it can be high also in cases where the model poorly predicts one outcome. It
may be more informative in these cases to compute the percent correctly predicted
for each outcome separately: 1) the number of times yi = yi = 1 over the number of
times yi = 1 and 2) the number of times yi = yi = 0 over the number of times yi = 0.
This is done through the post-estimation command estat classification.
• Test the discrepancy of the actual frequency of an outcome and the estimated aver-
age probability of the same outcome within a subsample S of interest (for example,
females in a sample of workers)
yS ⌘ 1
nS
X
i2Syi vs. pS ⌘ 1
nS
X
i2SF⇣
x
0iˆ�⌘
Doing this on the whole sample makes little sense because the two measures are
always very close (equal in the logit model with the intercept).
12.6. INDEPENDENT LATENT HETEROGENEITY 182
• Evaluate the Pseudo R-squared: ˜R2= 1 � L (�) /L (y) , where L (�) is the value of
the maximized log-likelihood and L (y) is the log-likelihood evaluated for the model
with only the intercept. Always 0 < ˜R2 < 1.
12.5. Endogenous regressors
In the presence of a continuous endogenous regressors in the latent regression model one
can use an instrumental variable probit estimator. This is implemented by ivprobit, with a
syntax similar to ivregress. When the endogenous regressor is binary, then we can apply a
bivariate recursive probit model as explained in Subsection 12.7.1.
12.6. Independent latent heterogeneity
In the latent regression model (12.2.4) all explanatory variables are observed. But it
may be the case that relevant explanatory variables are latent, as allowed by the following
specification of the model
y⇤ = x
0� +w
0�w + ",
where the w’s are latent variables. There is so a latent heterogeneity component ↵ ⌘ w
0�w
to consider in the model along with ". We make the following assumptions
• ↵|x, " ⇠ N�
0,�2�
Then ↵+ "|x ⇠ N�
0, 1 + �2�
and
y⇤p1 + �2
= x
0 �p1 + �2
+
↵+ "p1 + �2
,
is a legitimate probit model. In fact, y⇤/p1 + �2 is latent,
↵+ "p1 + �2
|x ⇠ N (0, 1)
and so
�
✓
x
0 �p1 + �2
◆
= Pr (y = 1|x) .
12.7. MULTIVARIATE PROBIT MODELS 183
It follows that we can apply standard probit ML estimation and the resulting estimator\�/p1 + �2 is consistent for �/
p1 + �2 and so is �
✓
x
0 \�p1+�2
◆
for the response probabilities
Pr (y = 1|x) .
From the above analysis it clearly emerges that \�/p1 + �2 estimates � with a downward
bias (Yatchew and Griliches (1985)). Nonetheless, if our interest centers on marginal effects
@xPr (y|↵,x) averaged over ↵ (AME’s), E↵ [@xPr (y|↵,x)], this is no problem.
Indeed, given f (↵|x) the conditional density function of ↵, it is generally true that
Pr (y|x) =ˆ
↵|x
Pr (y|x,↵) f (↵|x) d↵
But since ↵ and x are independent f (↵|x) = f (↵) and so
Pr (y|x) =ˆ↵
Pr (y|x,↵) f (↵) d↵ = E↵ [Pr (y|↵,x)]
Hence, under mild regularity conditions that permit interchanging integrals and derivatives,
@xPr (y|x) = E↵ [@xPr (y|↵,x)]
The above result is important, for it establishes that to estimate Pr (y|x) and @xPr (y|x)
is to estimate E↵ [Pr (y|↵,x)] and E↵ [@xPr (y|↵,x)], respectively. So, �
✓
x
0 \�p1+�2
◆
is a
consistent estimator for E↵ [Pr (y|↵,x)], likewise @x�
✓
x
0 \�p1+�2
◆
is a consistent estimator of
E↵ [@xPr (y|↵,x)] (see Wooldridge (2005) and Wooldridge (2010)).
If evaluated at a given point x
0, the AMEs are averages over ↵ alone. To estimate
Ex,↵ [Pr (y|↵,x)] and E
x,↵ [@xPr (y|↵,x)] just average �✓
x
0i\�p1+�2
◆
and @x�
✓
x
0i\�p1+�2
◆
over
the sample.
12.7. Multivariate probit models
Multivariate probit models can be conveniently represented using the latent-regression
framework. There are m binary variables, y1, , y2... ym that may be related.
12.7. MULTIVARIATE PROBIT MODELS 184
Multivariate probit models are constructed by supplementing the random vector y defined
in (12.2.1) with the latent regression model
(12.7.1) y⇤j = x
0�j + "j
j = 1, ...,m, and �j , x and "j are, respectively, the p⇥1 vectors of parameters and explanatory
variables and the error term. Stacking all "j ’s into the vector " ⌘ ("1, ... , "m)
0, we assume
"|x ⇠ N (0, R). The covariance matrix R is subject to normalization restrictions that will be
made explicit below. Equation specific regressors are accommodated by allowing �j to have
zeroes in the positions of the variables in x that are excluded from equation j. Cross-equation
restrictions on the �’s are also permitted. R is normalized for scale and so has unity diagonal
elements and arbitrary off-diagonal elements, ⇢ij , which allows for possible cross-equation
correlation of errors. It may or may not present constraints beyond normalization. If m = 2
we have the bivariate probit model, which is estimated by the Stata command biprobit, with
a syntax similar to probit.
12.7.1. Recursive models. An interesting class of multivariate probit models is that
of the recursive models. In recursive probit models the variables in y =(y1, , y2... ym) are
allowed as right-hand-side variables of the latent system provided that the m ⇥m matrix of
coefficients on y is restricted to be triangular (Roodman 2011). This means that if the model
is bivariate, the latent system is
y⇤1 = x�1 + �y2 + "1
y⇤2 = x�2 + "2(12.7.2)
It is then evident that estimating a bivariate recursive probit model is ancillary to estimation
of a univariate probit model with a binary endogenous regressor, the first equation of system
(12.7.2).
12.7. MULTIVARIATE PROBIT MODELS 185
The feature that makes the recursive multivariate probit model appealing is that it accom-
modates endogenous, binary explanatory variables without special provisions for endogeneity,
simply maximizing the log-likelihood function as if the explanatory variables were all ordinary
exogenous variables (see Maddala 1983, Wooldridge 2010,Greene 2012 and, for a general proof,
Roodman 2011). This can be easily seen here in the case of the recursive bivariate model
Pr (y1 = 1, y2 = 1|x) = Pr (y1 = 1|y2 = 1,x)P (y2 = 1|x)
= Pr⇥
"1 > �x
0�1 � �|y2 = 1,x⇤
P [y2 = 1|x]
= Pr⇥
"1 > �x
0�1 � �|"2 > �x
0�2,x⇤
P⇥
"2 > �x
0�2|x⇤
= Pr⇥
"1 > �x
0�1 � �, "2 > �x
0�2|x⇤
= �2�
x
0�1 + �,x0�2
�
The crux of the above derivations is that, given
y1 = 1
�
"1 > �x
0�1 � �y2�
and y2 = 1
�
"2 > �x
0�2
�
,
"1 is independent of the lower limit of integration conditional on "2 > �x
0�2 and so no endo-
geneity issue emerges when working out the joint probability as a joint normal distribution.
The other three joint probabilities are similarly derived, so that eventually the likelihood
function is assembled exactly as in a conventional multivariate probit model1.
Starting with the contributions of Evans and Schwab (1995) and Greene (1998), there
are by now many econometric applications of this model, including the recent articles by
Fichera and Sutton (2011) and Entorf (2012). The user-written command mvprobit deals
with m > 2, it evaluates multiple integrals by simulation (see Cappellari and Jenkins (2003)).
1Wooldridge 2010 argues that, although not strictly necessary for formal identification, substantial identification
in recursive models may require exclusion restrictions in the equations of interest. For example, in system
(12.7.2) substantial identification requires some zeroes in �1, where the corresponding variables may then be
thought of as instruments for y2.
12.7. MULTIVARIATE PROBIT MODELS 186
The recent user-written command cmp (see Roodman (2011)) is a more general sumulation-
based procedure that can estimate many multiple-response and multivariate models.
CHAPTER 13
Censored and selection models
13.1. Introduction
• Censored models (Tobit models): y has lower and/or upper limits
• Selection models: some values of y are missing not at random.
13.2. Tobit models
Consider the latent regression model
y⇤ = x
0� + ",
with "|x ⇠ N�
0,�2�
. y is an observed random variable such that
y =
8
>
<
>
:
y = y⇤ if y⇤ > L
y = L if y⇤ L
where L is a known lower limit.
Think of a utility maximizer individual with latent and observable characteristics " and
x, respectively, choosing y subject to the inequality constraint y � L, with y⇤ denoting the
unconstrained solution. For a part of individuals the constraint is binding (y = L ) and for
the others is not (y > L). Focussing on the former subpopulation, the regression model is
y = E (y|x, y > L) + u
= E�
x
0� + "|x, " > L� x
0��
+ u
x
0� + E�
"|x, " > L� x
0��
+ u(13.2.1)187
13.2. TOBIT MODELS 188
where u = y � E (y|x, y > L) . The following result for the density and moments of the trun-
cated normal distribution are useful (see Greene 2012, pp. 874-876):
f (z|z > ↵) =
1
��
✓
z � µ
�
◆
/ {1� � [(↵� µ) /�]}
f (z|z < ↵) =
1
��
✓
z � µ
�
◆
/� [(↵� µ) /�]
E (z|z > ↵) = µ+ �� [(↵� µ) /�]
1� � [(↵� µ) /�]
E (z|z < ↵) = µ� �� [(↵� µ) /�]
� [(↵� µ) /�].
The foregoing equalities are all based on the following representations of general cumulative
distribution function, F(µ,�2) :
F(µ,�2) (↵) = Pr [(z � µ) /� (↵� µ) /�] = F(0,1) [(↵� µ) /�]
and general normal densities �(µ,�2) (z) :
�(µ,�2) (z) =1
�p2⇡
exp
"
�(z � µ)2
2�2
#
=
1
��
✓
z � µ
�
◆
.
Then, Model (13.2.1) can be written in closed form as
y = x
0� + �� [(L� x
0�) /�]
1� � [(L� x
0�) /�]+ u.
By symmetry of the normal distribution,
(13.2.2) y = x
0� + �� [(x
0� � L) /�]
� [(x
0� � L) /�]+ u.
If L = 0, the foregoing reduces to
y = x
0� + �� [(x
0�) /�]
� [(x
0�) /�]+ u.
13.2. TOBIT MODELS 189
13.2.1. Estimation. There is a random sample {yi,xi} , i = 1, ..., n, for estimation. Let
di = 1 (yi > L). Estimation can be via ML or two-step LS.
The log-likelihood function assembles the density functions peculiar to the subsample of
individuals di = 1 and those peculiar to individuals di = 0 (left-censored). For an individual
di = 1, yi = y⇤i and we know that y⇤i |xi ⇠ N�
x
0i�,�
2�
. Therefore, we can evaluate the density
at the single point yi � x
0i�
f (yi|x) =1
��
✓
yi � x
0i�
�
◆
,
For a left-censored individual (di = 0), all we know is that "i L� x
0i�, so we integrate the
density over the interval "i L � x
0i� to get Pr (yi = L|xi) = � [(L� x
0i�) /�]. Therefore,
the log-likelihood function is
lnL (y1...yn|x1...xn,�) =
nX
i=1
⇢
diln
1
��
✓
yi � x
0i�
�
◆�
+ (1� di) ln �
(L� x
0i�)
�
��
.
The ML estimator bML is consistent for �, asymptotically normal and asymptotically efficient.
Two-step LS, b2�step, is based on Equation (13.2.2). In the first step we apply a probit
regression using di as the dependent variable to estimate � [(x
0� � L) /�] /� [(x
0� � L) /�]
as \�i/�i ⌘ � (x
0bprobit) /� (x
0bprobit) (recall that bprobit is indeed a consistent estimate of
�/� and L/� is subsumed in the constant term). In the second step apply OLS regression
of yi on xi and \�i/�i restricting to the unconstrained subsample di = 1. using yi. b2�step
is consistent but standard errors needs to be adjusted since in the second step there is an
estimated regressor.
Upper limits can be dealt similarly:
y =
8
>
<
>
:
y = y⇤ if y⇤ < U
y = U if y⇤ � U.
13.3. A SIMPLE SELECTION MODEL 190
Also, lower and upper limits jointly:
y =
8
>
>
>
>
>
<
>
>
>
>
>
:
y = L if y⇤ L
y = y⇤ if L < y⇤ < U
y = U if y⇤ � U
The Stata command that compute bML in the tobit model is tobit. The syntax is
similar to regress, requiring in addition options specifying lower limits, ll(#), and upper
limits, ul(#).
Marginal effects of interest are
• @xE (y⇤|x) = �
• @xE (y|x, y > L) =
�
1� w� (w)� �2(w)
�, where w = (x
0� � L) /� and � (w) =
� [(x
0� � L) /�] /� [(x
0� � L) /�] .
• @xE (y|x, y > L) = � (w)�.
These are computed by margins and the older mfx.
13.2.2. Heteroskedasticy and clustering. The same consideration made for binary
models in Sections 12.2.3 and 12.2.4 hold here. While heteroskedasticty breaks down the
specification of conditional expectations, clustering does not. Therefore, it makes sense to
apply the Stata option vce(cluster clustervar).
13.3. A simple selection model
Two processes: the first select the units into the sample, the second generates y. The two
processes are related: selection is endogenous: it cannot be ignored!
Selection process:
s⇤ = z
0� + ⌘,
13.3. A SIMPLE SELECTION MODEL 191
d = 1 (s⇤ > 0)
Process for y:
y⇤ = x
0� + ",
y =
8
>
<
>
:
y = y⇤ if d = 1
y = missing if d = 0
Interest is on �. The two processes are related for " and ⌘ are:
0
@
⌘
"
1
A |z,x ⇠ N
2
4
0
@
0
0
1
A ,
0
@
1 �⌘"
�⌘" �2"
1
A
3
5
Estimation is via ML. The log-likelihood is
lnL =
nX
i=1
{diln [f (yi|di = 1)Pr (di = 1)] + (1� di) ln [Pr (di = 0)]} .
The Stata command that compute bML in the selection model is heckman. The syntax
is similar to regress, requiring in addition an option specifying the list of variables in the
selection process, d and z: select(varlist_s ).
CHAPTER 14
Quantile regression
14.1. Introduction
Define the conditional c.d.f. of Y : F (y|x) = Pr (Y y|x). Instead of E (y|x), as in the
CRM, we model quantiles of F (y|x) .
The conditional median function of y, Q0.5 (y|x) , is an example. Specifically, for given x
and F (y|x) , Q0.5 (y|x) is a function that assigns the median of F (y|x) to x, and is implicitly
defined as
F [Q0.5 (y|x) |x] = 0.5
or, explicitly,
(14.1.1) Q0.5 (y|x) = F�1(0.5|x) .
More generally, given, the quantile q 2 (0, 1), define the conditional quantile function, Qq (x),
as
(14.1.2) Qq (y|x) = F�1(q|x) .
14.2. Properties of conditional quantiles
Define a predictor function y (x) and let " (y,x) be the corresponding prediction error
" (y,x) = y � y (x)
then
• Lq = qE"(y,x)�0 [" (y,x)]+(q � 1)E"(y,x)<0 [" (y,x)] is minimized when y (x) = Qq (y|x) .192
14.3. ESTIMATION 193
• Qq (y|x) is equivariant to monotone transformations: Let h (·) be a monotonic func-
tion, then Qq [h (y) |x] = h [Qq (y|x)]
In the case of the median:
L0.5 = Ey,x (|" (y,x)|) =) Q0.5 (y|x) minimizes Ey,x (|" (y,x)|)
In other words Q0.5 (y|x) is the minimum mean absolute error predictor.
14.3. Estimation
There is a sample {yi,xi} , i = 1, ..., n, for estimation.
14.3.1. Marginal effects. We can put the QR model in close relationship with the CRM.
Let
yi = E (yi|xi) + ui
where ui = yi � E (yi|xi), then
Qq (yi|xi) = E (yi|xi) +Qq (ui|xi) .
This is proved by invoking the equivariant property of Qq (·|x):
Qq (yi|xi) = Qq [E (yi|xi) + ui|xi] = E (yi|xi) +Qq (ui|xi) .
14.3.1.1. i.i.d. case. If {ui} , i = 1, ..., n are i.i.d. and independent of xi then Qq (ui|xi) is
constant over the sample and varies only with q, Qq (ui|xi) = �q, so that
Qq (yi|xi) = E (yi|xi) + �q,
which implies that here the marginal effects computed by the regression model coincides with
those computed from the QR. Therefore
@x
Qq (yi|xi) = @x
E (yi|xi)
14.3. ESTIMATION 194
and
E (yi|xi) = x
0i� =) @
x
Qq (yi|xi) = �.
14.3.1.2. General case. If the i.i.d. assumptions does not hold (e.g. because of het-
eroskedasticity), then
@x
Qq (yi|xi) = @x
E (yi|xi) + @x
Qq (ui|xi)
and
E (yi|xi) = x
0i� =) @
x
Qq (yi|xi) = � + @x
Qq (ui|xi) .
Also,
(14.3.1) E (yi|xi) = x
0i� and Qq (ui|xi) = x
0i�q =) @
x
Qq (yi|xi) = � + �q.
14.3.2. The linear QR. The linear quantile regression model specifies Qq (yi|xi) as a
linear function
Qq (yi|xi) = x
0i�q
or equivalently
yi = x
0i�q + uq,i
where uq,i = yi � x
0i�q and from (14.3.1), �q = � + �q. A consistent estimator, bq, for �q is
found by the analogy principle:
bq = argminb
8
<
:
qX
yi
�x
0i
b
�
yi � x
0ib�
+ (q � 1)
X
yi
<x
0i
b
�
yi � x
0ib�
9
=
;
.
Under mild regularity conditions
bqa⇠ N
�
�q, A�1BA�1
�
where A =
P
i q (1� q)xix0i , B =
P
i fuq
(0|xi)xix0i and fu
q
(0|xi) is the conditional density
of uq,i evaluated at uq,i = 0. The latter makes A�1BA�1 difficult to estimate, better apply
conventional bootstrap procedures.
14.4. AN HETEROSKEDASTIC REGRESSION MODEL WITH SIMULATED DATA 195
The main Stata command implementing QR is qreg. Its syntax is similar to �regress.
The quantile(.##) option in qreg indicates the quantile of choice (e.g. to get the median,
which is also the default, set .##=.50) To produce QR estimates with bootstrap standard
errors apply bsqreg. The reps(#) in bsqreg indicates the number of bootstrap replications.
Implementing the same model at various quantiles through repeated qreg regressions can
shed light on the discrepancies in behavior across different regions of the variable of interest. To
evaluate the statistical significance of such discrepancies, though, it is necessary to estimate
a larger covariance matrix encompassing covariances between coefficient estimators across
quantiles. This is done by sqreg, which provides simultaneous estimates from the quantile
regressions chosen by the researcher.
14.4. An heteroskedastic regression model with simulated data
This is based on ch. 7.4 in Cameron and Trivedi (2009) and clarifies a couple of somewhat
intricate points therein.
Generate the data from an heteroskedastic linear regression model:
y = 1 + x2 + x3 + u,
u = (0.1 + 0.5x2) ",
x2 ⇠ �2(1) ,
x3|x2 ⇠ N (0, 25) ,
"|x2, x3 ⇠ N (0, 25) .
It is not hard to verify that E (y|x) = 1 + x2 + x3. Therefore,
@x2E (y|xi) = 1
@x3E (y|xi) = 1.
14.4. AN HETEROSKEDASTIC REGRESSION MODEL WITH SIMULATED DATA 196
and we expect that an OLS regression will yield coefficient estimates close to the foregoing
marginal effects. Also,
Qq (u|x) = Qq [(0.1 + 0.5x2) "|x] = (0.1 + 0.5x2)Qq ("|x) = 0.1�q + 0.5�qx2
where the first equality is obvious, the second follows from the equivariance property and the
last from independence of " and x2 yielding �q = Qq ("|x). Then, according to (14.3.1) we
have
@x2Qq (y|xi) = 1 + 0.5�q
@x3Qq (y|xi) = 1.
Therefore, we expect that quantile regressions will yield coefficient estimates for x3 close to
the OLS estimate, regardless of the quantile considered, whilst for x2 we will observe various
discrepancies with the OLS estimate depending on the quantile regression. This is confirmed
by the outcome of the dofile qr.do.
Bibliography
Abowd, J. M., Kramarz, F., Margolis, D. N., 1999. High wage workers and high wage firms.
Econometrica 67, 251–333.
Anderson, T. W., Hsiao, C., 1982. Formulation and estimation of dynamic models using panel
data. Journal of Econometrics 18, 570–606.
Andrews, D. W. K., Moreira, M. J., Stock, J. H., 2007. Performance of conditional wald tests
in iv regression with weak instruments. Journal of Econometrics 139, 116–132.
Arellano, M., 1987. Computing robust standard errors for within-groups estimators. Oxford
Bulletin of Economics and Statistics 49 (4), 431–34.
Arellano, M., 2003. Panel Data Econometrics. Oxford University Press.
Arellano, M., Bond, S., 1991. Some tests of specification for panel data: Monte carlo evidence
and an application to employment equations. Review of Economic Studies 58, 277–297.
Baltagi, B. H., 2008. Econometric Analysis of Panel Data. New York: Wiley.
Blundell, R., Bond, S., 1998. Initial conditions and moment restrictions in dynamic panel data
models. Journal of Econometrics 87, 115–143.
Bowsher, C. G., 2002. On testing overidentifying restrictions in dynamic panel data models.
Economics Letters 77, 211–220.
Bruno, G. S. F., 2005a. Approximating the bias of the lsdv estimator for dynamic unbalanced
panel data models. Economics Letters 87, 361–366.
Bruno, G. S. F., 2005b. Estimation and inference in dynamic unbalanced panel data models
with a small number of individuals. The Stata Journal 5, 473–00.
Bun, M. J. G., Kiviet, J. F., 2003. On the diminishing returns of higher order terms in
asymptotic expansions of bias. Economics Letters 79, 145–152.
197
Bibliography 198
Cameron, A. C., Gelbach, J. B., Miller, D. L., 2011. Robust inference with multiway clustering.
Journal of Business & Economic Statistics 29, 238–249.
Cameron, A. C., Trivedi, P. K., 2009. Microeconometrics using Stata. Stata Press, College
Station, TX.
Cappellari, L., Jenkins, S. P., 2003. Multivariate probit regression using simulated maximum
likelihood. The Stata Journal 3, 278–294.
Cragg, J., Donald, S., 1993. Testing identfiability and specification in instrumental variables
models. econometric theory, vol. 9, pp. Econometric Theory, 222–240.
Entorf, H., 2012. Expected recidivism among young offenders: Comparing specific deterrence
under juvenile and adult criminal law. European Journal of Political Economy 28, 414–429.
Evans, W. N., Schwab, R. M., 1995. Finishing high school and starting college: Do catholic
schools make a difference? The Quarterly Journal of Economics 110, 941–974.
Fichera, E., Sutton, M., 2011. State and self investment in health. Journal of Health Economics
30, 1164–1173.
Greene, W. H., 1998. Gender economics courses in liberal arts colleges: Further results. Journal
of Economic Education 29, 291–300.
Greene, W. H., 2008. Econometric Analysis, sixth Edition. Upper Saddle River, NJ: Prentice
Hall.
Greene, W. H., 2012. Econometric Analysis, seventh Edition. Upper Saddle River, NJ: Prentice
Hall.
Hansen, L. P., 1982. Large sample properties of generalized method of moments estimators.
Econometrica 50 (4), 1029–1054.
Hausman, J., 1978. Specification tests in econometrics. Econometrica 46, 1251–1271.
Hausman, J. A., Taylor, W., 1981. Panel data models and unobservable individual effects.
Econometrica 49, 1377–1398.
Hayashi, F., 2000. Econometrics. Princeton University Press.
Bibliography 199
Judson, R. A., Owen, A. L., 1999. Estimating dynamic panel data models: a guide for macroe-
conomists. Economics Letters 65, 9–15.
Kiviet, J. F., 1995. On bias, inconsistency and efficiency of various estimators in dynamic
panel data models. Journal of Econometrics 68, 53–78.
Kiviet, J. F., 1999. Expectation of expansions for estimators in a dynamic panel data model;
some results for weakly exogenous regressors. In: C. Hsiao, K. Lahiri, L.-F. L., Pesaran,
M. H. (Eds.), Analysis of Panels and Limited Dependent Variable Models. Cambridge Uni-
versity Press, Cambridge, pp. 199–225.
Maddala, G. S., 1983. Limited-Dependent and Qualitative Variables in Econometrics. Cam-
bridge University Press, Cambridge.
Mikusheva, A., Poi, B. P., 2006. Tests and confidence sets with correct size when instruments
are potentially weak. The Stata Journal 6, 335–347.
Moulton, B. R., 1990. An illustration of a pitfall in estimating the effects of aggregate variables
on micro units. The Review of Economics and Statistics 72 (2), 334–38.
Mundlak, Y., 1978. On the pooling of time series and cross section data. Econometrica 46,
69–85.
Nickell, S. J., 1981. Biases in dynamic models with fixed effects. Econometrica 49, 1417–1426.
Prucha, I. R., 1984. On the asymptotic efficiency of feasible aitken estimators for seemingly
unrelated regression models with error components. Econometrica 52, 203–207.
Rao, C. R., 1973. Linear Statistical Inference and Its Applications. New York: Wiley.
Roodman, D. M., 2009. How to do xtabond2: An introduction to difference and system gmm
in stata. The Stata Journal 9 (1), 86–136.
Roodman, D. M., 2011. Fitting fully observed recursive mixed-process models with cmp. The
Stata Journal 11, 159–206.
Searle, S. R., 1982. Matrix Algebra Useful for Statistics. New York: Wiley.
Stock, J. H., Watson, M. W., 2008. Heteroskedasticity-robust standard errors for fixed effects
panel data regression. Econometrica 76, 155–74.
Bibliography 200
Swamy, P. A. B., Arora, S. S., 1972. The exact finite sample properties of the estimators of
coefficients in the error components regression models. Econometrica 40 (2), 261–275.
White, H., 2001. Asymptotic Theory for Econometricians, revised edition Edition. Emerald.
Windmeijer, F., 2005. A finite sample correction for the variance of linear efficient two-step
gmm estimators. Journal of Econometrics 126, 25–51.
Wooldridge, J. M., 2005. Unobserved heterogeneity and estimation of average partial effects.
In: Andrews, D. W. K., Stock, J. H. (Eds.), Identification And Inference For Econometric
Models: Essays In Honor Of Thomas Rothenberg. Cambridge University Press, New York.
Wooldridge, J. M., 2010. Econometric Analysis of Cross Section and Panel Data, 2nd Edition.
The MIT Press, Cambridge, MA.
Yatchew, A., Griliches, Z., 1985. Specification error in probit models. Review of Economics
and Statistics 67, 134–139.
Zyskind, G., 1967. On canonical forms, non-negative covariance matrices and best and simple
least squares linear estimators in linear models. Annals of Mathematical Statistics 36, 1092–
09.