Econometrics

Econometric Models: Lecture Notes for 2014/15

ESS/DES Econometrics (revised February 2015)

Giovanni Bruno

Department of Economics, Bocconi University, Milano

E-mail address: [email protected]

skip 3.6.1 and cover 3.6.2 only for the definition of the squared coefficient of partial correlation between two variables; going through the remaining few parts not covered in class is useful exercise), Ch. 4, Ch. 7 (do 7.3 limited to what was done in class; skip all proofs of asymptotic normality and the derivations in 7.4; do pp. 297-299 in Greene (2012)) , Ch. 8 (skip the reference to Ch. 5 at p. 115, skip all derivations of the large-sample results in 8.6), Ch. 10 (skip 10.4.2, 10.7, 10.9), Ch. 11, Ch 12 (skip 12.5).

normality and the derivations in 7.4; do pp. 297-299 in Greene (2012)) , Ch. 8 (skip the reference to Ch. 5 at p. 115, skip all derivations of the large-sample results in 8.6), Ch. 10 (skip 10.4.2, 10.7, 10.9), Ch. 11, Ch 12 (skip 12.5).

Contents

Part 1. Linear Models 7

Chapter 1. Introduction 8

1.1. Introduction 8

1.2. The linear population model 10

Chapter 2. The linear regression model 13

2.1. From the linear population model to the linear regression model 13

2.2. The properties of the LRM 14

2.3. Difficulties 15

Chapter 3. The Algebraic properties of OLS 18

3.1. Motivation, notation, conventions and main assumptions 18

3.2. Linear combinations of vector 19

3.3. OLS: definition and properties 19

3.4. Spanning sets and orthogonal projections 27

3.5. OLS residuals and fitted values 30

3.6. Partitioned regression 31

3.7. Goodness of fit and the analysis of variance 37

3.8. Centered and uncentered goodness-of-fit measures 39

Chapter 4. The finite-sample statistical properties of OLS 43

4.1. Introduction 43

4.2. Unbiasedness 433

CONTENTS 4

4.3. The Gauss-Marcov Theorem 44

4.4. Estimating the covariance matrix of OLS 47

4.5. Exact tests of significance with normally distributed errors 48

4.6. The general law of iterated expectation 58

4.7. The omitted variable bias 59

4.8. The variance of an OLS individual coefficient 65

4.9. Residuals from partitioned OLS regressions 70

Chapter 5. The Oaxaca’s model: OLS, optimal weighted least squares and group-wise

heteroskedasticity 71


5.2. Embedding the Oaxaca’s model into a pooled regression framework 71

5.3. The OLS estimator in the Oaxaca’s model is BLUE 75

5.4. A general result 77

Chapter 6. Tests for structural change 81

6.1. The Chow predictive test 81

6.2. An equivalent reformulation of the Chow predictive test 85

6.3. The classical Chow test 87

Chapter 7. Large sample results for OLS and GLS estimators 90


7.2. OLS with non-spherical error covariance matrix 91

7.3. GLS 96

7.4. Large sample tests 103

Chapter 8. Fixed and Random Effects Panel Data Models 108


8.2. The Fixed Effect Model (or Least Squares Dummy Variables Model) 108

CONTENTS 5

8.3. The Random Effect Model 117

8.4. Stata implementation of standard panel data estimators 123

8.5. Testing fixed effects against random effects models 125

8.6. Large-sample results for the LSDV estimator 129

8.7. A Robust covariance estimator 132

8.8. Unbalanced panels 134

Chapter 9. Robust inference with cluster samplings 136


9.2. Two-way clustering 137

9.3. Stata implementation 142

Chapter 10. Issues in linear IV and GMM estimation 144


10.2. The method of moments 146

10.3. Stata implementation of the TSLS estimator 150

10.4. Stata implementation of the (linear) GMM estimator 151

10.5. Robust Variance Estimators 152

10.6. Durbin-Wu-Hausman Exogeneity test 152

10.7. Endogenous binary variables 155

10.8. Testing for weak instruments 156

10.9. Inference with weak instruments 156

10.10. Dynamic panel data 157

Part 2. Non-linear models 169

Chapter 11. Non-linear regression models 170


11.2. Non-linear least squares 170

CONTENTS 6

11.3. Poisson model for count data 171

11.4. Modelling and testing overdispersion 174

Chapter 12. Binary dependent variable models 176


12.2. Binary models 177

12.3. Coefficient estimates and marginal effects 180

12.4. Tests and Goodness of fit measures 181

12.5. Endogenous regressors 182

12.6. Independent latent heterogeneity 182

12.7. Multivariate probit models 183

Chapter 13. Censored and selection models 187


13.2. Tobit models 187

13.3. A simple selection model 190

Chapter 14. Quantile regression 192


14.2. Properties of conditional quantiles 192

14.3. Estimation 193

14.4. An heteroskedastic regression model with simulated data 195

Bibliography 197

Part 1

Linear Models

CHAPTER 1

Introduction

1.1. Introduction

Indeed, causation is not always exactly the same as correlation. Econometrics uses eco-

nomic theory, mathematics and statistics to quantify economic structural relationship, often

in the search of causal links among the variables of interest.

Although rather schematic, the following discussion should convey the basic intuition of

how this process works.

Economic theory provides the econometrician with an economic structural model,

(1.1.1) y = ! (x, ")

where ! : Rk+q ! R. Often, the structural relationship is formulated as a probabilistic model

for a given population of interest. So, y denotes a random scalar, x = (x1... xk)0 2 X ✓ Rk is

a 1 ⇥ k random vector of explanatory variables of interest and " is a 1 ⇥ q random vector of

latent variables. A structural model can be understood as one showing a causal relationship8

Giovanni Valente

Giovanni Valente

Giovanni Valente

Giovanni Valente

Giovanni Valente

Giovanni Valente

Giovanni Valente

Giovanni Valente

Giovanni Valente

1.1. INTRODUCTION 9

from the economic factors of interest, x, to the economic response or dependent variable y.

Often in applications q = 1, which means that " is treated as a catch-all random scalar.

For example, ! (x, ") may be the expenditure function in a population of (possibly) het-

erogenous consumers, with preferences " and facing income and prices x; or it may be the

marshallian demand function for some good in the same population, with x denoting prices

and total consumption expenditure; also it may be the demand function for some input of a

population of (possibly) heterogenous firms facing input and output prices x, with " comprising

technological latent heterogeneity, and so on.1

The individual ! (x, "), with its gradient vector of marginal effects, @x

! (x, "), and hessian

matrix, Dxx

! (x, "), are typically the structural objects of interest, but sometimes attention is

centered upon aggregate structural objects, such as the population-averaged structural func-

tion, ˆ! (x, ") dF (") ,

the population-averaged marginal effects,ˆ

@x

! (x, ") dF (") ,

or the population averaged hessian matrixˆ

Dxx

! (x, ") dF (") .

Statistics supplements the probabilistic model with a sampling mechanism in order to

estimate characteristics of the population from a sample of observables. The population objects

of interest may be, for example, the joint probability distribution of the observables, F (y,x) ,

and its moments, E (y), E (x), E (xx

0), E (xy), the conditional distribution of y given x,

F (y|x) and its moments, E (y|x) and V ar (y|x) . Or functions of the above:

1Wooldridge (2010) prefers to think of ! (x, ") as a structural conditional expectation: E (y|x, ") ⌘ ! (x, ") .

There is nothing in the present analysis that prevents such interpretation.

Giovanni Valente

Giovanni Valente

Giovanni Valente

Giovanni Valente

1.2. THE LINEAR POPULATION MODEL 10

The key question is under what conditions these estimable statistical objects are informa-

tive on ! (x, "). Evidently, to establish a mapping between the structural economic object

of interest and the foregoing statistical objects the econometrician needs to model the rela-

tionship between observables and unobservables in ! (x, ") and do so in a plausible way. The

restrictions that are used to this purpose are said identification restrictions. The next sections

describe the simplest probabilistic model for equation (1.1.1), the linear population model.

1.2. The linear population model

Equation (1.1.1) is a linear model of the population if the following assumptions hold.

P.1: Linearity : ! (x, ") = x

0� + ", with " being a random scalar (q = 1) and � a k ⇥ 1

vector of fixed parameters

P.2: rank [E (xx

0)] = k or, equivalently, det [E (xx

0)] 6= 0

P.3: Conditional-mean-independence of " and x: E ("|x) = 0

Under linearity (P.1) equation (1.1.1) becomes

(1.2.1) y = x

0� + ",

then, given P.3, E (y|x) = x

0�.

An equivalent, but easier to interpret, formulation of assumption P.2 states:

P.2b: No element of x in X can be obtained as linear combinations of the others with

probability equal to one: Pr (a0x = 0) = 1 only if a = 0.

The following proves equivalence of P.2 and P.2b (not crucial and rather technical, it can be

skipped towards the exam). I exploit the properties of the expectation and rank operators.

Assume P.2 and Pr (a0x = 0) = 1 for some conformable constant vector a. Then E (a

0xx

0a) =

0, and so a

0E (xx

0)a = 0, which implies a = 0 by P.2, proving P.2b. Now, assume P.2b and pick

any a 6= 0. Then, Pr (a0x = 0) 6= 1 and so Pr (a0x 6= 0) > 0. But since a

0x 6= 0 is equivalent

to a

0xx

0a > 0, then Pr (a0xx0

a > 0) = Pr (a0x 6= 0) > 0. So, since Pr (a0xx0a � 0) = 1,

Giovanni Valente

Giovanni Valente

Giovanni Valente

Giovanni Valente

Giovanni Valente

Giovanni Valente


E (a

0xx

0a) > 0, which in turn implies a

0E (xx

0)a > 0. Therefore, E (xx

0) is positive definite

and so non-singular, that is P.2.

Exercise 1. prove that if x = (1 x1)’ then assumption P.2 is equivalent to V ar (x1) 6= 0.

Solution:

E�

xx

0�=

0

@

1 E (x1)

E (x1) E�

x21�

1

A

and so det E (xx

0) = E

�

x21�

�E2(x1) = V ar (x1), and the claim is proved by noting that for

any k ⇥ k matrix A, rank (A) = k if and only if det (A) 6= 0.

By equation (1.1.1) and assumption P.1, the latent part of ! (x, "), ", satisfies the following

equation

" = y � x

0�

and the marginal effects, @x

! (x, "), satisfy the following:

@x

! (x, ") = �.

By assumption P.3 and the law of iterated expectations E (x") = 0. Since " = y � x

0�, then

we have the system of k moment conditions

(1.2.2) E�

xy � xx

0��

= 0

or E (xy) = E (xx

0)�. Assumption P.2, then, ensures that the foregoing system can be solved

for � to have

(1.2.3) � = E�

xx

0��1E (xy)

At this point the linear probabilistic model establishes a precise mapping between, on the

one hand, the structural objects of interest, ! (x, "), " and @x

! (x, ") and on the other the

observable or estimable objects y, x, E (xx

0) and E (xy). Indeed, ! (x, "), " and @

x

! (x, ")

are equal to unique known transformations of y, x, E (xx

0) and E (xy). This means that


! (x, "), " and @x

! (x, ") can be estimated using estimators for E (xx

0) and E (xy), whose

choice depends on the underlying sampling mechanism. The most basic strategy is to carry

out estimation within the linear regression model and its variants. In essence, the linear regres-

sion model is the linear probabilistic model supplemented by a random sampling assumption.

This ensures optimal properties of the ordinary least squares estimator (OLS) and its various

generalizations.

A more restrictive specification of the linear model maintains the assumptions of condi-

tional homoskedasticity and normality

P.4: V ar ("|x) = �2.

P.5: "|x ⇠ N�

0,�2�

.

A more general variant of the linear model, instead, replaces assumption P.3 with

P.3b: E (x") = 0.

Under P.3b it is still true that � = E (xx

0)

�1E (xy) and @x

! (x, ") = �, with the virtue that

the conditional expectation E (y|x) is left unrestricted. Therefore, with P.3b replacing P.3,

the model is more general.

The function x

0�, with � = E (xx

0)

�1E (xy), is relevant in either version of the linear

model and is said the linear projection of y onto x.

Exercise 2. Prove that if x contains 1, then E (x") = 0 is equivalent to E (") = 0 and

cov (",x) = 0 (hint: remember that cov (",x) = E (x")� E (x)E (") ).

Solution: Assume E (x") = 0. Since 1 is an element of x, the first component of E (x") is

E (") = 0, then, given cov (",x) = E (x")� E (x)E ("), cov (",x) = 0. Assume E (") = 0 and

cov (",x) = 0. Then, E (x") = E (x)E (") = 0.

CHAPTER 2

The linear regression model

The linear regression model is a statistical model, as such it incorporates a probabilistic

model of the population and a sampling mechanism that draws the data from the population.

2.1. From the linear population model to the linear regression model

Consider the linear model of the previous chapter: the population equation (1.1.1)

y = ! (x, ")

with the assumptions

P.1: Linearity : ! (x, ") = x

0�+ ", with x = (x1 x2... xk)0 being a k⇥ 1 random vector,

" a random scalar and � = (�1 �2... �k)0 a k ⇥ 1 vector of fixed parameters

P.2: rank [E (xx

0)] = k for all x 2 X .

P.3: Conditional-mean-independence of " and x: E ("|x) = 0

Now, add the random sampling assumption

RS: There is a sample of size n from the population equation, such that the elements of

the sequence {(yi xi1 xi2 ...xik) , i = 1, ..., n} are independently identically distributed

(i.i.d.) random vectors.

Given P.1-P.3 and RS, we have the linear regression model (LRM)

(2.1.1) yi = x

0i� + "i

with x

0i = (xi1 xi2 ...xik), i = 1, ..., n and {"i = yi � x

0i�, i = 1, ..., n} is a sequence of unob-

served i.i.d. errors terms.13

2.2. THE PROPERTIES OF THE LRM 14

2.2. The properties of the LRM

It is convenient to express the LRM in compact matrix form as follows

(2.2.1) y = X� + "

where

y

n⇥1=

0

B

B

B

B

B

B

B

B

B

B

@

y1...

yi...

yn

1

C

C

C

C

C

C

C

C

C

C

A

, Xn⇥k

=

0

B

B

B

B

B

B

B

B

B

B

@

x

01...

x

0i...

x

0n

1

C

C

C

C

C

C

C

C

C

C

A

, "n⇥1

=

0

B

B

B

B

B

B

B

B

B

B

@

"1...

"i...

"n

1

C

C

C

C

C

C

C

C

C

C

A

.

It is not hard to see that model (2.2.1), given P.1-P.3 and RS, satisfies the following properties

LRM.1: Linearity in the parameters

LRM.2: X has full column rank, that is rank (X) = k.

LRM.3: The variables in X are strictly exogenous, that is

E�

"i|x01, ... x

0i, ..., x

0n

�

= 0,

i = 1, ..., n, or more compactly, E ("|X) = 0.

LRM.1 is obvious. LRM.2 requires that no columns of X can be obtained as linear combina-

tions of other columns of X or, equivalently, that a = 0 if Xa = 0, or also equivalently that for

any a 6= 0, there exists at least one observation i = 1, ..., n, such that x0ia 6= 0. P.2 ensures that

this occurs with non-zero probability, which approaches unity as n ! 1. LRM.3, instead, is

a consequence of P.3 and RS. This is proved as follows. By P.3, E ("i|x0i) = 0, i = 1, ..., n or

E (yi|x0i)� x

0i� = 0, i = 1, ..., n. Since

E�

"i|x01, ... x

0i, ..., x

0n

�

= E�

yi|x01, ... x

0i, ..., x

0n

�

� x

0i�

2.3. DIFFICULTIES 15

and by RS, E (yi|x0i) = E (yi|x0

1, x02, ..., x

0n), then

E�

"i|x01, ... x

0i, ..., x

0n

�

= E�

yi|x0i

�

� x

0i� = 0.

If, in addition, P.4 (conditional homoskedasticity) and P.5 (conditional normality) hold for

the population model, then one easily verifies that

LRM.4: V ar ("|X) = �2In.

LRM.5: "|X ⇠ N�

0,�2In�

While LRM.1-LRM.5 are less restrictive than P.1-P.5 and RS and, in most cases, sufficient for

accurate and precise inference, they are still strong assumptions to maintain. Finally, if P.3 is

replaced by P.3b, E (x") = 0, then LRM.3 gets replaced by

LRM.3b: E (

Pni=1 xi"i) = 0, i = 1, ..., n, or more compactly,

E�

X 0"�

= 0.

2.3. Difficulties

Some or all of LRM.1-LRM.5 may not be verified if the population model assumptions

and/or the RS mechanism are not verified in reality. Here is a list of the most important

population issues.

• Non-linearities (P.1 fails): the model is non-linear in the parameters. This leads

LRM.1 to fail.

• Perfect multicollinearity (P.2 fails): some variables in x are indeed linear combina-

tions of the others. LRM.2 fails, but in general this is not a serious problem, it

simply indicates that the model has not been parametrized correctly to begin with.

A different parametrization will restore identification in most cases.

• Endogeneity (P.3 fails): some variables in x are related to ". LRM.3 fails.

• Conditional heteroskedasticity (P.4 fails): the conditional variance depends on x.

LRM.4 fails.


• Non-normality (P.5 fails): " is not conditionally normal. LRM.5 fails.

Other important problems are instead with the RS assumption.

• Omitted variables: some of the variables in x are not sampled. This implies that

the missing variables cannot enter the conditioning set and have to be treated as

unobserved errors, along with ", which could make LRM.3-LRM.5 fail.

• Measurement error: some of the variables in x are measured with error. We have

the wrong variables in the conditioning set. As in the case of omitted variables,

LRM.3-LRM.5 may fail.

• Endogenous selection: some units in the sample are missing due to events related to

". Also in this case, LRM.3-LRM.5 are likely to fail.

Notice that often problems in the RS mechanism have their roots in the population model.

For example, the presence of non-random variables in x is not in general compatible with an

identically distributed sample and, in consequence with RS. It is easy to verify, though, that

non-random x along with a weaker sampling mechanism only requiring independent sampling

is compatible with LRM.1-LRM.5. Also, the presence of variables in x at different levels of

aggregation may not be compatible with an independent sampling, as observed by Moulton

(1990). In this case, the sampling mechanism can be relaxed by maintaining independence

only across groups of observations and not across observations themselves. See for example

the sampling mechanism described in Section 8.6 for panel data models, in which the sample

is neither identically distributed nor independent across observations.

Finally, it is important to emphasize that even if all the population assumptions and the

RS mechanism are valid, data problems may arise in the form of multicollinearity among

regressors.

• Multicollinearity: some of the variables in X are almost collinear. In the population

this is reflected by det [E (xx

0)] ' 0; and in the sample by det (X 0X) ' 0.


As we will see in Chapter 4, although multicollinearity does not affect the statistical properties

of the estimators in finite samples, it can severely affect the precision of the coefficient estimates

in terms of large standard errors.

CHAPTER 3

The Algebraic properties of OLS

3.1. Motivation, notation, conventions and main assumptions

We do not agree with Larry (the adult croc), do we? Algebra may be boring, but only if

its purpose is left obscure. Algebra in econometrics provides the bricks to construct estimators

and tests. The fact that most estimators and tests are automatically implemented by statistical

packages is no excuse to neglect the underlying algebra. First, because most does not mean

all and there may be the case that for our research work we have to build the technique by

ourselves. This is especially true for the most recent techniques. A robust hausman test for

panel data models and multiway cluster robust standard errors are just a few examples of

techniques that are not yet coded by the popular statistical packages. Second, even if the

technique is available as a built-in procedure in our preferred statistical package, to use it

correctly we have to know how it is made, which boils down to understanding its algebra.

Finally, often interpretation of results requires that we are aware of the algebraic properties

of estimators and tests. So the material here may seem rather intricate at times, but it is

certainly of practical use.18

3.3. OLS: DEFINITION AND PROPERTIES 19

This chapter is based on my lecture notes in matrix algebra as well as on Greene (2008),

Searle (1982) and Rao (1973). Throughout, I denotes a conformable identity matrix; 0 denotes

a conformable null matrix, vector or scalar, with the appropriate meaning being clear from

the context; y is a real n⇥ 1 vector containing the observations of the dependent variable; X

is a real (n⇥ k) regressor matrix of full column rank.

The do-file algebra_OLS.do demonstrates the results of this chapter using the Stata data

set US_gasoline.dta.

3.2. Linear combinations of vector

Given the real (n⇥ k) matrix A, the columns of A are said linearly dependent if there

exists some non-zero (k ⇥ 1) vector b such thatAb = 0 .

Given the real (n⇥ k) matrix A, the columns of A are said linearly independent if Ab = 0

only if b = 0.

Two real non-zero (n⇥ 1) vectors a and b are said to be orthogonal if a0b = 0. Given two

real non-zero (n⇥ k) matrices A and B, if each column of A is orthogonal to all columns of

B, so that A0B = 0, then A and B are said to be orthogonal.

3.3. OLS: definition and properties

We do not have any model in mind here, just data for the response variable

y =

0

B

B

B

B

B

B

B

B

B

B

@

y1...

yi...

yn

1

C

C

C

C

C

C

C

C

C

C

A


and the n⇥ k regressor matrix

X =

0

B

B

B

B

B

B

B

B

B

B

@

x

01...

x

0i...

x

0n

1

C

C

C

C

C

C

C

C

C

C

A

,

where x

0i = (xi1 xi2 ...xik) . I only maintain that rank (X) = k.

We aim at finding an optimal approximation of y using the information contained in y

and X. One such approximation can be obtained through the ordinary least squares estimator

(OLS), b, defined as the minimizer of the residual sum of squares S (bo):

b = argmin

b

o

S (bo) ,

where

S (bo) = (y �Xbo)0(y �Xbo) .

Geometrically, Xb is an optimal approximation of y in that it minimizes the euclidean distance

from the vector y to the hyperplane Xbo. As such, b satisfies

@S (bo)

@bo

�

�

�

�

b

o

=b

= 0.

By expanding (y �Xbo)0(y �Xbo):

S (bo) = y

0y � b

0oX

0y � y

0Xbo + b

0oX

0Xbo

= y

0y � 2y

0Xbo + b

0oX

0Xbo.

Then, taking the partial derivatives

@S (bo)

@bo= �2X 0

y + 2X 0Xbo


so that the first order conditions (OLS normal equations) of the minimization problem are

(3.3.1) �X 0y +X 0Xb = 0,

with the resulting formula for the OLS estimator

(3.3.2) b =

�

X 0X��1

X 0y

Notice that

• The estimator exists since X 0X is non-singular, X being of full-column rank.

• The estimator is a true minimizer since the Hessian of S (bo),

@2S (bo)

@bo@b0o

= 2X 0X,

is a positive definite matrix (i.e. S (bo) is globally convex in bo). The latter is easily

proved as follows. A matrix A is said to be positive definite if the quadratic form

c

0Ac > 0 for any conformable vector c 6= 0. By the full column rank assumption

z = Xc 6= 0 for any c 6= 0 therefore c

0X 0Xc = z

0z =

nX

i=1

z2i > 0 for any c 6= 0.

The OLS residuals are defined as

(3.3.3) e = y �Xb

By (3.3.1) it follows that e and X are orthogonal:

X 0(y �Xb) = 0.(3.3.4)

Therefore, if X contains a column of all unity elements, say 1, three important implications

follows.

(1) The sample mean of e is zero: 1

0e =

nX

i=1

ei = 0 and consequently, e =1

n

nX

i=1

ei = 0.


(2) The OLS regression line passes through the point sample means (y,x), that is y = x

0b,

where y = (

Pni=1 yi) /n and

x

0=

1n

nX

i=1

x1i . . . 1n

nX

i=1

xki

!

(it follows straightforwardly from 1

0e = 1

0(y �Xb) = 0).

(3) Let

(3.3.5) y = Xb

denote the OLS predicted values of y, and ˆ

y denote their sample mean

y =

1

n

nX

i=1

yi

then, since the sample mean of Xb equals x

0b,

y = y.

3.3.1. Stata implementation: get your Stata data file with use. All Stata data

files can be recognized by their filetype dta. Suppose you have y and X within a Stata data

file called, say, mydata.dta, stored in your Stata working directory and that you have just

launched Stata on your laptop. To get your data into memory, from the Stata command line

execute use followed by the name of the data file (specifying the filetype dta is not necessary

since use only supports dta files):

use mydata

If mydata.dta is not in your Stata working directory but somewhere else in your laptop,

then you must specify the path of the dta file. For example, if you have a mac and your data

file is in the folder /Users/giovanni you will write

use /Users/giovanni/mydata

If you have a pc and your file is in c:\giovanni


use c:\giovanni\mydata

If the path involves folders with names that include blanks, then include the whole path

into double quotes. For example:

use "/Users/giovanni/didattica/Greene/dati/ch. 1/mydata"

3.3.2. Stata implementation: the help command. To know syntax, options, usage

and examples for any Stata command, write help from the command line followed by the

name of the command for which you want help. For example,

help use

will make appear a new window describing use:


.

insheet, [D] odbc, [D] save, [D] sysuse, [D] webuse infile (free format), [D] infile (fixed format), [D] infix, [D] Help: [D] compress, [D] datasignature, [D] fdasave, [D] haver, [D]

Manual: [D] use

Also see

. tab foreign, nolabel . use using myauto if foreign==1 . tab foreign, nolabel . use if foreign == 0 using myauto

. describe . use make rep78 foreign using myauto

. save myauto . keep make price mpg rep78 weight foreign . use http://www.stata-press.com/data/r11/auto, clear

. replace rep78 = 3 in 12 . use http://www.stata-press.com/data/r11/auto

Examples

unlikely that you will ever want to specify this option. nolabel prevents value labels in the saved data from being loaded. It is

the current data have not been saved to disk. clear specifies that it is okay to replace the data in memory, even though

Options

In the second syntax for use, a subset of the data may be read.

quotes. filename contains embedded spaces, remember to enclose it in double filename is specified without an extension, .dta is assumed. If your use loads a Stata-format dataset previously saved by save into memory. If

Description

File > Open...

Menu

use [varlist] [if] [in] using filename [, clear nolabel]

Load subset of Stata-format dataset

use filename [, clear nolabel]

Load Stata-format dataset

Syntax

[D] use Use Stata dataset

Title


3.3.3. Stata implementation: OLS estimates with regress. Now that you have

loaded your data into memory, Stata can work with them. Suppose your dependent variable

y is called depvar and that X contains two variables, x1 and x2. To run the OLS regression of

depvar on x1 and x2 with the constant term included, you write regress followed by depvar

and, then, the names of the regressors:

regress depvar x1 x2

The following example shows the regression in example 1.2 of Greene (2008) with annual

values of US aggregate consumption (c) used as the dependent variable and regressed on

annual values of US personal income (y) for the period 1970-1979.

.

_cons -67.58063 27.91068 -2.42 0.042 -131.9428 -3.218488 y .9792669 .031607 30.98 0.000 .9063809 1.052153 c Coef. Std. Err. t P>|t| [95% Conf. Interval]

Total 64972.1238 9 7219.12487 Root MSE = 8.193 Adj R-squared = 0.9907 Residual 537.004616 8 67.125577 R-squared = 0.9917 Model 64435.1192 1 64435.1192 Prob > F = 0.0000 F( 1, 8) = 959.92 Source SS df MS Number of obs = 10

. regress c y

. use "/Users/giovanni/didattica/Greene/dati/ch. 1/1_1.dta"

regress includes the constant term (the unity vector) by default and always with the

name _cons. If you don’t want it, just add the regress option, noconstant:

regress depvar x1 x2, noconstant

Notice that always, according to a general rule of the Stata syntax, the options of any Stata

command follow the comma symbol. This means that if you wish to specify options you have

to write the comma symbol after the last argument of the command, so that everything to the


right of the comma symbol is held by Stata as an option. Options can be more than one. Of

course, if you do not wish to include options don’t write the comma symbol.

After execution, regress leaves behind a number of objects in memory, mainly scalars

and matrices, that will stay there, available for use, until execution of the next estimation

command. To know what these objects are, consult the section Saved results in the help

of regress, where you will find the following description.

e(sample) marks estimation sample Functions

e(V_modelbased) model-based variance e(V) variance-covariance matrix of the estimators e(b) coefficient vector Matrices

e(asobserved) factor variables fvset as asobserved e(asbalanced) factor variables fvset as asbalanced e(marginsok) predictions allowed by margins e(predict) program used to implement predict e(estat_cmd) program used to implement estat e(properties) b V e(vcetype) title used to label Std. Err. e(vce) vcetype specified in vce() e(clustvar) name of cluster variable e(title) title in estimation output when vce() is not ols e(wexp) weight expression e(wtype) weight type e(model) ols or iv e(depvar) name of dependent variable e(cmdline) command as typed e(cmd) regress Macros

e(rank) rank of e(V) e(N_clust) number of clusters e(ll_0) log likelihood, constant-only model normal errors e(ll) log likelihood under additional assumption of i.i.d. e(rmse) root mean squared error e(F) F statistic e(r2_a) adjusted R-squared e(r2) R-squared e(df_r) residual degrees of freedom e(rss) residual sum of squares e(df_m) model degrees of freedom e(mss) model sum of squares e(N) number of observations Scalars

regress saves the following in e():

Saved results

3.4. SPANNING SETS AND ORTHOGONAL PROJECTIONS 27

You should be already familiar with some of the e() objects in the Scalars and Matrices

parts. At the end of the course you will be able to understand most of them. Don’t worry

about the Macros and Functions parts, they are rather technical and, however, not relevant

for our purposes.

To know the values taken on by the e() objects, execute the command ereturn list just

after the regress instruction. In our regression example we have:

e(sample) functions:

e(V) : 2 x 2 e(b) : 1 x 2matrices:

e(estat_cmd) : "regress_estat" e(model) : "ols" e(predict) : "regres_p" e(properties) : "b V" e(cmd) : "regress" e(depvar) : "c" e(vce) : "ols" e(marginsok) : "XB default" e(title) : "Linear regression" e(cmdline) : "regress c y"macros:

e(rank) = 2 e(ll_0) = -58.08502782843004 e(ll) = -34.10649331948547 e(r2_a) = .9907017016262866 e(rss) = 537.0046160573024 e(mss) = 64435.11918375102 e(rmse) = 8.193020017500434 e(r2) = .9917348458900325 e(F) = 959.919036180133 e(df_r) = 8 e(df_m) = 1 e(N) = 10scalars:

. ereturn list

3.4. Spanning sets and orthogonal projections

Consider the n-dimensional Euclidean space Rn and the (n⇥ k) real matrix A. Then, each

column of A belongs to Rn and the set of all linear combinations of the columns of A is said

the space spanned by the columns of A (or also the range of A), denoted by R (A) .


R (A) can be easily proved to be a subspace of Rn (it is obvious that R (A) ✓ Rn; R (A)

is a vector space since, given any two vectors a1 and a2 belonging to R (A), then a1 + a2 2

R (A) and ca1 2 R (A) for any real scalar c). Since each element of R (A) is e vector of n

components, R (A) is said to be a vector space of order n. The dimension of R (A), denoted

by dim [R (A)], is the maximum number of linearly independent vectors in R (A). Therefore,

dim [R (A)] = rank (A) and if A is of full column rank, then dim [R (A)] = k.

The set of all vectors in Rn that are orthogonal to the vectors of R (A) is denoted by A?.

I now prove that A? is a subspace of Rn. A? ✓ Rn by definition. Given any two vectors

b1 and b2 belonging to A? and for any a 2 R (A), b

01a = 0 and b

02a = 0, but then also

(b1 + b2)0a = 0 and, for any scalar c, (cb0

1)a = 0, which completes the proof.

Importantly, it is possible to prove, but not pursued here, that

(3.4.1) dim⇣

A?⌘

= n� rank (A) .

A? is commonly referred to as the space orthogonal to R (A), or also the orthogonal comple-

ment of R (A) .

Exercise 3. Prove that any subspace of Rn contains the null vector.

For simplicity, assume A of full column rank and define the operator P[A] as

P[A] = A�

A0A��1

A0.

As an exercise you can verify that P[A] is a symmetric (P 0[A] = P[A]) and idempotent (P[A]P[A] =

P[A]) matrix. With this two properties, P[A] is said an orthogonal projector. In geometrical

terms, P[A] projects vectors onto R (A) along a direction that is parallel to the space orthogonal

to R (A), A?. Symmetrically,

M[A] = I � P[A]

is the orthogonal projector that projects vectors onto A? along a direction that is parallel to

the space orthogonal to A?, R (A).


Exercise 4. Prove that M[A] is an orthogonal projector (hint: just verify that M[A] is

symmetric and idempotent).

The properties of orthogonal projectors, established by the following exercises, are readily

understood, once one grasps the geometrical meaning of orthogonal projectors. They can be

also demonstrated algebraically, which is what the exercises require.

Exercise 5. Given two (n⇥ k) real matrices A and B, both of full column rank, prove

that if A and B span the same space than P[A] = P[B] (hint: prove that A can be always

expressed as A = BK where Kis a non-singular (k ⇥ k) matrix).

Solution: If R (A) coincides with R (B), then every column of A belongs to R (B), and

as such every column of A can be expressed as a linear combination of the columns of B,

A = BK, where K is (k ⇥ k) . Therefore, P[A] = BK (K 0B0BK)

�1K 0B. An important result

of linear algebra states that given two conformable matrices C and D, then rank (CD)

min [rank (C) , rank (D)] (see Greene (2008), p. 957, (A-44)). Since both A and B have rank

equal to k, in the light of the foregoing inequality, k min [k, rank (K)], which implies that

rank (K) � k, and since rank (K) > k is not possible, then rank (K) = k and K is non-

singular. Finally, by the property of the inverse of the product of square matrices (see Greene

(2008), p. 963, (A-64))

P[A] = BK�

K 0B0BK��1

K 0B0

= BKK�1�

B0B��1 �

K 0��1K 0B0

= P[B].

Exercise 6. Prove that P[A] and M[A] are orthogonal, that is P[A]M[A] = 0.

3.5. OLS RESIDUALS AND FITTED VALUES 30

3.5. OLS residuals and fitted values

The foregoing results are useful to properly understand the properties of OLS. But before

going on, do the the following exercise.

Exercise 7. Given any (n⇥ 1) real vector v lying onto R (A) prove that P[A]v = v and

M[A]v = 0 (hint: express v as v = Ac, where c is a real (k ⇥ 1) vector).

From exercise 7 it clearly follows that

(3.5.1) P[A]A = A

and

(3.5.2) M[A]A = 0.

Using the OLS formula in (3.3.2)

(3.5.3) e = M[X]y,

where

M[X] = I �X�

X 0X��1

X 0.

Therefore, the OLS residual vector, e, is the orthogonal projection of y onto the space orthog-

onal to that spanned by the regressors, X?. For this reason M[X] is said the “residual maker”.

From (3.3.2) and (3.3.5) it follows that

ˆ

y = P[X]y

and so the vector of OLS predicted (fitted) values, ˆ

y, is the orthogonal projection onto the

space spanned by the regressors, R (X). Clearly, e

0ˆ

y = 0 (see exercise 6 or also equation

(3.3.4)), therefore the OLS method decomposes y into two orthogonal components

(3.5.4) y =

ˆ

y + e.

3.6. PARTITIONED REGRESSION 31

3.5.1. Stata implementation: an important post-estimation command, predict.

In applications ˆ

y and e are useful for a number of purposes. They can be obtained as new vari-

ables in your Stata data set using the post-estimation command predict. The way predict

works is simple. Imagine you have just executed your regress instruction. To have now ˆ

y in

your data as a variable called, say, y_hat, just write from the command line:

predict y�hat

You have thereby created a new variable with name y_hat that contains the ˆ

y values. Fitted

values are the default calculation of predict, if you want residuals just add the res option:

predict resid , res

and you have got a new variable in your data called resid that contains the e values.

It is important to stress that predict supports any estimation command, not only regress.

So, it can be implemented, for example, after xtreg in the context of panel data.

3.6. Partitioned regression

Partition X =

⇣

X1 X2

⌘

and, accordingly,

b =

0

@

b1

b2

1

A

Exercise 8. Prove that if X is of full column rank, so are X1 and X2.

Exercise 9. Prove that if X is of full column rank, then M[X1]X2 is of f.c.r. Solution:

I prove the result by contradiction and assume that P[X1]X2b = X2b for some vector b 6= 0.

Therefore, X1a = X2b, where a = (X 01X1)

�1X 01X2b, or Xc = 0, where c

0= (a

0 � b

0), which

leads to a contradiction since c 6= 0 and X is of f.c.r.


Reformulate the normal equations accordingly0

@

X 01

X 02

1

A

y �

0

@

X 01X1 X 0

1X2

X 02X1 X 0

2X2

1

A

0

@

b1

b2

1

A

= 0

or

X 01y �X 0

1X1b1 �X 01X2b2 = 0(3.6.1)

X 02y �X 0

2X1b1 �X 02X2b2 = 0(3.6.2)

Expliciting the first system of equations for b1

(3.6.3) b1 =�

X 01X1

��1X 0

1 (y �X2b2) ,

which shows at once the important result that if X1 and X2 are orthogonal, then

b1 =�

X 01X1

��1X 0

1y

and

b2 =�

X 02X2

��1X 0

2y,

that is b1 (b2) can be obtained by the reduced OLS regression of y on X1 (X2).

In general situations things are not so simple, but it is still possible to work out both

components of b in closed form solution. Replace the right hand side of equation (3.6.3) into

the second system (3.6.2) to obtain

X 02y �X 0

2X1�

X 01X1

��1X 0

1 (y �X2b2)�X 02X2b2 = 0

or equivalently, using the orthogonal projector notation P[X1] for X1 (X01X1)

�1X 01,

X 02y �X 0

2P[X1] (y �X2b2)�X 02X2b2 = 0.


Rearrange the foregoing equation

X 02

�

I � P[X1]

�

y +X 02

�

P[X1] � I�

Xb2 = 0

so that eventually

b2 =�

X 02M[X1]X2

��1X 0

2M[X1]y.

By symmetry,

b1 =�

X 01M[X2]X1

��1X 0

1M[X2]y.

By inspecting either formula it emerges that b2 (b1 respectively) can be obtained by a

regression where the dependent variable is the residual vector obtained by regressing y on

X1 (X2) and the regressors are the residuals obtained from the regressions of each column

of X2 (X1) on X1 (X2). This important result is known in the econometric literature as

Frisch-Waugh-Lovell Theorem.

Since b exists, so do its components, which proves that X 01M[X2]X1 and X 0

2M[X1]X2 are

non-singular when X is of full column rank. This result can be also verified by direct inspection,

as suggested by the following exercise.

Exercise 10. Prove that X 02M[X1]X2 is positive definite if X is of f.c.r. (hint: use exercise

9 to prove that M[X1]X2 is of full column rank and then the fact that M[X1] is symmetric and

idempotent). Solution: By exercise 9, M[X1]X2a 6= 0 if a 6= 0. Let z = M[X1]X2a, hence

a

0X 02M[X1]X2a = z

0z is a sum of squares with at least one positive element. Therefore,

a

0X 02M[X1]X2a > 0.

Exercise 11. Partitioning X =

⇣

X1 1

⌘

, where 1 is the (n⇥ 1) vector of all unity

elements, prove that M[1] = I � 1 (1

01)

�11

0 transforms all variables in deviations from their

sample means, and so that the OLS estimator b1 can be obtained by regressing y demeaned

onX demeaned.


The following result on the decomposition of orthogonal projectors into orthogonal com-

ponents will be useful in a number of occasions later on.

Lemma 12. Given the partitioning X = (X1 X2) , the following representation of P[X]

holds

(3.6.4) P[X] = P[X1] + P[

M[X1]X2]

,

with both terms in the right hand side of (3.6.4) being orthogonal1.

3.6.1. Add one regressor. Given the initial regressor matrix, X, include an additional

regressor z (z is a non-zero (n⇥ 1) vector), so that there is a larger regressor matrix, W ,

partitioned as W =

⇣

X z

⌘

.

Now, consider the OLS estimator from the regression of y on W

0

@

d

c

1

A

=

�

W 0W��1

W 0y

with the resulting OLS residual

u = y �W

0

@

d

c

1

A

= y �Xd� zc(3.6.5)

1Equation (3.6.4) can be proved directly using the formula for the inverse of the 2 ⇥ 2 partitioned inverse, or

indirectly, but more easily, by using the FWL Theorem and noticing that, for any non-null y and X = (X1 X2)

of f.c.r., P[X]y = X1b1 + X2b2, where b1 = (X 01X1)

�1 X 01 (y �X2b2) and b2 =

�X 0

2M[X1]X2

��1X 0

2M[X1]y.

You just have to plug the right hand sides of b1 and b2 into the right hand side of P[X]y = X1b1+X2b2, work

through all the algebraic simplifications and eventually notice that the equation you end up with, P[X]y =⇣P[X1] + P[M[X1]X2]

⌘y, must hold for any non-null y, so that P[X] = P[X1] + P[M[X1]X2].


and the formula for c in closed form solution, obtained as a specific application of the general

analysis of the previous section,

c =

�

z

0M[X]z��1

z

0M[X]y

=

z

0M[X]y

z

0M[X]z(3.6.6)

where M[X] = I �X (X 0X)

�1X 0.

Exercise 13. Derive the formula for d in closed form solution.

We wish to prove that the residual sum of squares always decreases when one regressor is

added to X, i.e. given e defined as in (3.3.3) u

0u < e

0e. To this purpose, it is convenient to

express d as in equation (3.6.3)

d =

�

X 0X��1

X 0(y � zc) .

Replacing the foregoing equation into (3.6.5) yields

u = y �X�

X 0X��1

X 0(y � zc)� zc

= y �X�

X 0X��1

X 0y +X

�

X 0X��1

X 0zc� zc

= M[X]y �M[X]zc.

Since M[X]y = e (from (3.5.3), remember that M[X] is the “residual maker”)

u = e�M[X]zc,

so that the residual sum of squared for the enlarged regression is

u

0u = e

0e� e

0M[X]zc� cz0M[X]e+ c2z0M[X]z

= e

0e� 2cz0M[X]e+ c2z0M[X]z.(3.6.7)


Then, by replacing e with M[X]y into the second addendum of equation (3.6.8) and considering

that M[X] is idempotent gives

u

0u = e

0e� 2cz0M[X]y + c2z0M[X]z.

Finally, replace c from (3.6.6) into the foregoing equation to have

u

0u = e

0e� 2

�

z

0M[X]y�2

z

0M[X]z+

�

z

0M[X]y�2

z

0M[X]z

= e

0e�

�

z

0M[X]y�2

z

0M[X]z(3.6.8)

and since (

z

0M[X]y)2

z

0M[X]z> 0,

(3.6.9) u

0u < e

0e.

Exercise 14. How would formulas for c, d and u

0u change if the new regressor z is

orthogonal to X?

3.6.2. The squared coefficient of partial correlation (not covered in class, but

good and easy exercise). The squared coefficient of partial correlation between y and z,

r⇤2yz,

r⇤2yz =

�

z

0M[X]y�2

z

0M[X]zy0M[X]y

,

measures the extent to which y and z are related net of the variation of X. In this sense it is

closely related to c and indeed

c = r⇤2yzy

0M[X]y

z

0M[X]y.

Moreover by (3.5.3), y0M[X]y = e

0e, and hence given equation (3.6.8) it has

u

0u = e

0e

�

1� r⇤2yz�

.

3.7. GOODNESS OF FIT AND THE ANALYSIS OF VARIANCE 37

3.7. Goodness of fit and the analysis of variance

Assume that the unity vector, 1, is part of the regressor matrix X. Total variation in y can

be expressed by the following sum of squares, referred to as Sum of Squared Total deviations

SST =

nX

i=1

(yi � y)2

or, given what established in exercise 11,

SST = y

0M[1]y

Notice that SST is the sample variance of y, SST/ (n� 1), times the appropriate degrees of

freedom correction n�1. Incidentally, the degrees-of-freedom correction in the sample variance

is just n�1 and not n, since M[1]y are the residuals from the regression of y on 1 (see exercise

11) and so there could be no more than n�1 linearly independent vectors in the space to which

M[1]y belongs, 1?. In fact, since rank (1) = 1, then given equation (3.4.1), dim�

1

?�= n�1.

By the orthogonal decomposition for y in (3.5.4),

M[1]y = M[1]ˆy +M[1]e.

But since e and X are orthogonal and X contains 1, it follows that 1

0e = 0, thereby

(3.7.1) M[1]e = e

and

M[1]y = M[1]ˆy + e.

Then,

SST =

ˆ

y

0M[1]ˆy + 2e

0M[1]ˆy + e

0e.

3.7. GOODNESS OF FIT AND THE ANALYSIS OF VARIANCE 38

But since equation (3.7.1) holds, ˆy = Xb and e and X are orthogonal, then e

0M[1]ˆy = e

0ˆ

y =

e

0Xb = 0 and so SST simplifies to

SST =

ˆ

y

0M[1]ˆy + e

0e.

Throughout, I stick to Greene’s acronyms and refer to ˆ

y

0M[1]ˆy as SSR (regression sums

of squares) and e

0e as SSE (error sum of squares). Notice, however, that the name of “error

sum of squares” may be misleading in contexts where the distinction between random errors

and estimated residuals is crucial, given that e0e actually represents the part of total variation

explained by OLS residuals and not errors. For this reason, I continue to refer to e

0e as the

residual sum of squares and sometimes abbreviate it with the Greene’s acronym SSE.

As it happens for SST , SSE is the sample variance of residuals times the appropriate

degrees-of-freedom correction, n � k. Again, the degrees-of-freedom correction in the sample

variance is just n� k and not n since in the residual space, X?, there could be no more than

n� k linearly independent vectors. This follows from the assumption that X is of full column

rank, thereby rank (X) = k, then given equation (3.4.1), dim�

X?�= n� k.

The coefficient of determination, R2, is defined as

(3.7.2) R2=

SSR

SST=

ˆ

y

0M[1]ˆy

y

0M[1]y

and since ˆ

y

0M[1]ˆy = yM[1]y � e

0e,

R2= 1� e

0e

y

0M[1]y.

Therefore, if the constant term is included into the regression it has that 0 R2 1 and

R2 measures the portion of total variation in y explained by the OLS regression; in this sense

R2 is a measure of goodness of fit2. There are two interesting extreme cases. If all regressors,

apart from 1, are null vectors, then ˆ

y lies onto the space spanned by 1 and M[1]ˆy = 0, so

that eventually R2= 0. Only the constant term has explanatory power in this case, and the

2If the constant term is not included into the regression than (3.7.1) does not hold and R2

may be negative.

3.8. CENTERED AND UNCENTERED GOODNESS-OF-FIT MEASURES 39

regression is an horizontal line with intercept equal to the sample mean of y. If y lies already

onto R (X), then y =

ˆ

y (and also e

0e = 0) and R2

= 1, a perfect (but useless) fit3.

A problem with the R2 measure is that it never decreases when a regressor is added to

X (this emerges straightforwardly from the R2 formula and the inequality in (3.6.9)) and in

principle one can obtain artificially high R2 by inflating the model with regressors (the extreme

case of R2= 1 is attained if n = k, since in this case y ends up to lie onto R (X)). This

problem may be obviated by using the corrected R2, R2, defined by including into the formula

of R2 the appropriate degrees of freedom corrections:

R2= 1� SSE/ (n� k)

SST/ (n� 1)

.

In fact, R2 does not necessarily increases when one more regressor is added.

Exercise 15. Prove that, given W , u and e defined as in section 3.6.1, the coefficient of

determination resulting from the regression of y on W is

R2W = R2

+

�

1�R2�

r⇤2yz

where R2 is the goodness of fit measure from the reduced regression.

Exercise 16. Prove that

R2= 1� n� 1

n� k

�

1�R2�

.

3.8. Centered and uncentered goodness-of-fit measures

Consider the OLS regression of y on the sample regressor matrix X and let b denote the

OLS vector. The centered and uncentered R-squared measures (see Hayashi (2000), p. 20) for

3I’m maintaining throughout the obvious assumption that in any case y /2 R (1) . Why?


this regression are defined as

(3.8.1) R2 ⌘ˆ

y

0M[1]ˆy

y

0M[1]y=

b

0X 0M[1]Xb

y

0M[1]y=

y

0P[X]M[1]P[X]y

y

0M[1]y

and

(3.8.2) R2u ⌘

ˆ

y

0ˆ

y

y

0y

=

b

0X 0Xb

y

0y

=

y

0P[X]y

y

0y

,

respectively. It is easy to prove that 0 R2u 1 and

R2u = 1� e

0e

y

0y

whether or not the unity vector 1 is included into X. In fact, since y = Xb+ e and X 0e = 0,

y

0y =

ˆ

y

0ˆ

y + e

0e. The same is not true for the centered measure. Indeed, 0 R2 1 and

(3.8.3) R2= 1� e

0e

y

0M[1]y

if and only if a) the constant is included, or b) all of the variables (y, X) have zero sample

mean, that is M[1]y = y and M[1]X = X. Clearly in the latter case, R2= R2

u.

A convenient property of the centered R-squared, when 1 is included into X, is that it

coincides with the squared simple correlation between y and ˆ

y, r2y,y, that is

(3.8.4) R2=

�

ˆ

y

0M[1]y�2

ˆ

y

0M[1]ˆyy0M[1]y

,

where R2 is defined in (3.8.1) and ˆ

y = Xb. Given the definition of R2 in (3.7.2), to prove

equation (3.8.4) boils down to proving ˆ

y

0M[1]y =

ˆ

y

0M[1]ˆy, which can be accomplished easily


along the following lines. Since y =

ˆ

y + e, then

ˆ

y

0M[1]y =

ˆ

y

0M[1] (ˆy + e)

=

ˆ

y

0M[1]ˆy +

ˆ

y

0M[1]e

=

ˆ

y

0M[1]ˆy +

ˆ

ye

=

ˆ

y

0M[1]ˆy.

where the third equality follows from M[1]e = e, since the constant is included, and the last

from the orthogonality of ˆy and the OLS residuals.

This property is not shared by the uncentered R-squared, unless variables have zero sample

means.

3.8.1. A convenient formula for R2 when the constant is included (not covered

in class, but good and easy exercise). Partitioning X as X =

⇣

˜X 1

⌘

and using Lemma

12 gives P[X] = P[1] + P[

M[1]X], which replaced into (3.8.1) gives in turn

R2=

y

0⇣

P[1] + P[

M[1]X]

⌘

M[1]

⇣

P[1] + P[

M[1]X]

⌘

y

y

0M[1]y.

But

⇣

P[1] + P[

M[1]X]

⌘

M[1]

⇣

P[1] + P[

M[1]X]

⌘

=

P[1]M[1]P[1] + P[

M[1]X]M[1]P[1] +

P[1]M[1]P[

M[1]X]+ P

[

M[1]X]M[1]P

[

M[1]X]= P

[

M[1]X],

so that eventually

(3.8.5) R2=

y

0P[

M[1]X]y

y

0M[1]y,


which proves at once that R2 defined in (3.8.1) can be also obtained as the uncentered R-

squared from the OLS regression of M[1]y on M[1]˜X, namely the OLS regression of y in

mean-residuals and ˜X in mean-residuals.

CHAPTER 4

The finite-sample statistical properties of OLS

4.1. Introduction

This chapter is on the finite-sample statistical properties of OLS applied to the LRM.

Finite-sample means that we focus on a fixed sample size n as opposed to n ! 1, a case that

will be covered in Chapter 7. We will learn under what assumptions on the LRM and in which

sense the estimator is optimal. We will also learn how to test linear restrictions on the model

parameters. Finally, we will study an important case of inaccuracy for the OLS, which is the

omitted-variables problem.

Results in this chapter are demonstrated through the do-file statistics_OLS.do using

the data-sets US_gasoline.dta and mus06data.dta (from Cameron and Trivedi 2009).

4.2. Unbiasedness

Under LRM.1-LRM.3, OLS is unbiased, that is E (b) = �.

This is proved as follows. From LRM.1, LRM.2 and OLS formula in (3.3.2)

(4.2.1) b = � +

�

X 0X��1

X 0".

From LRM.3, then,

E (b|X) = � +

�

X 0X��1

X 0E ("|X)

= �.

43

4.3. THE GAUSS-MARCOV THEOREM 44

Finally, using the law of iterated expectations

E (b) = EX [E (b|X)]

= EX [�]

= �.

Notice that unbiasedness does not follow if we replace LRM.3 with the weaker LRM.3b.

4.3. The Gauss-Marcov Theorem

Let’s work out the conditional and unconditional covariance matrices for OLS under

LRM.1-LRM.4.

I get started with V ar (b|X). Since,

V ar (b|X) = E⇥

(b� �) (b� �)0 |X⇤

,

then, given equation (4.2.1),

V ar (b|X) = Eh

�

X 0X��1

X 0""0X�

X 0X��1 |X

i

,

=

�

X 0X��1

X 0E⇥

""0|X⇤

X�

X 0X��1

= �2�

X 0X��1

,

where the last equality follows from LRM.3 and LRM.4.

Now I turn to V ar (b). Using the law of decomposition of variance

V ar (b) = EX [V ar (b|X)] + V arx [E (b|X)]

= �2EX

h

�

X 0X��1

i

,

where the last equality follows from E (b|X) = �.


Next I prove that OLS has the “smallest”, in a sense that will be clarified soon, covariance

matrix in the class of linear unbiased estimator.

I define the following partial order in the space of the l ⇥ l symmetric matrices:

Definition 17. Given any two l ⇥ l symmetric matrices A and B, A is said “no smaller”

than B if and only if A�B is non-negative definite (n.n.d.)

Given the partial order of definition 17, OLS has the “smallest” conditional covariance

matrix in the class of linear unbiased estimators. This important result is universally known

as the Gauss-Marcov Theorem. To prove it, define the generic member of the foregoing class

as

bo = Cy

where C is a generic k ⇥ n matrix that depends only on the sample information in X and

such that CX = Ik (how would you explain the last requirement?). OLS is of course a

member of the class, with its own C equal to COLS = (X 0X)

�1X 0. It is not hard to prove

that V ar (bo|X) = �2CC 0. Define, now D = C � COLS , then DX = 0 and so

V ar (bo|X) = �2h

D +

�

X 0X��1

X 0i h

D0+X

�

X 0X��1

i

= �2�

X 0X��1

+ �2DD0

= V ar (b|X) + �2DD0.

Since �2DD0 is n.n.d, according to Definition 17, V ar (b|X) is “no greater” than the variance

of any linear unbiased estimator.

The natural question arises of whether the partial order of Definition 17 is of any rel-

evance in real-world applications. It is, since it readily translates into the total order of

real numbers, which is the domain of the variances of random scalars. Indeed, if A is

“no smaller” than B, then r

0(A�B) r � 0, for any conformable r. But then, according

to the Gauss-Marcov Theorem, we can say that any linear combination of the components


of b, r

0b, has smaller conditional variance than r

0bo. Formally, the theorem implies that

r

0[V ar (bo|X)� V ar (b|X)] r � 0. Then, V ar (r0b|X) = r

0V ar (b|X) r and V ar (r0bo|X) =

r

0V ar (bo|X) r and hence V ar (r0bo|X) � V ar (r0b|X) . The importance of this hinges upon

the fact that in empirical applications we are interested in the linear combinations of popula-

tion coefficients, as in the following example, where it is shown that the estimates of individual

coefficients can always be expressed as specific linear combinations of the k components of the

estimators.

Example 18. On noticing that bi = r

0ib and boi = r

0ibo, i = 1, ..., k, where ri is the

k ⇥ 1 vector with all zero elements except the i.th entry, which equals unity, and given the

Gauss-Marcov Theorem, we conclude that V ar (boi|X) � V ar (bi|X) , i = 1, ..., k.

In general, we have that the OLS estimator of any linear combination r

0� is given by r

0b

and, as the foregoing discussion demonstrates, under LRM.1-LRM.4 r

0b is BLUE (you can

easily verify that E (r

0b) = r

0�).

Now we prove that the Gauss-Marcov Theorem extends to the unconditional variances.

From

V ar (bo|X) = V ar (b|X) + �2DD0,

it has

EX [V ar (bo|X)] = EX [V ar (b|X)] + �2EX�

DD0� ,

or

V ar (bo) = V ar (b) + �2EX�

DD0� ,

and since EX (DD0) is n.n.d., we can also state that the unconditional variance of OLS is “no

greater” than that of any linear unbiased estimator.

4.4. ESTIMATING THE COVARIANCE MATRIX OF OLS 47

4.4. Estimating the covariance matrix of OLS

Since �2 is unknown, so are V ar (b|X) and V ar (b). Unbiased estimators of V ar (b|X)

and V ar (b), therefore, require an unbiased estimator of �2. We now prove that, under LRM.1-

LRM.4, the residual sum of squares divided by the appropriate degrees of freedom correction,

s2 = e

0e/ (n� k), is one such estimator, namely E

�

s2�

= �2.

From e = M[X]y and LRM.1 it follows that e = M[X]". Hence,

E�

s2|X�

=

1

n� kE�

"0M[X]"|X�

.(4.4.1)

Since "0M[X]" is a scalar, "0M[X]" = tr "0M[X]" and so, by the permutation rule of the trace

of a matrix product, "0M[X]" = tr "0M[X]" = tr M[X]""0. Replacing the right hand side of the

foregoing equation into equation (4.4.1) yields

E�

s2|X�

=

1

n� kE�

tr M[X]""0|X

�

.

Then exploiting the fact that both trace and expectation are linear operators

E�

s2|X�

=

1

n� ktr E

�

M[X]""0|X

�

.

=

1

n� ktr M[X]E

�

""0|X�

=

�2

n� ktr M[X](4.4.2)

where the last equality follows from LRM.3 and LRM.4. Now, focus on tr M[X]:

tr M[X] = trh

In �X�

X 0X��1

X 0i

= tr In � tr�

X 0X��1

X 0X

= n� k,

4.5. EXACT TESTS OF SIGNIFICANCE WITH NORMALLY DISTRIBUTED ERRORS 48

and so the numerator and denominator in (4.4.2) simplify to have E�

s2|X�

= �2. Finally, by

the law of iterated expectations

E�

s2�

= �2.

With s2 at hand we can get an unbiased estimator for V ar (b). It is

\V ar (b) = s2�

X 0X��1

.

In fact

E⇣

\V ar (b)|X⌘

= �2�

X 0X��1

= V ar (b|X)

and since E⇣

\V ar (b)⌘

= EX

h

E⇣

\V ar (b)|X⌘i

by the law of iterated expectations, then

E⇣

\V ar (b)⌘

= �2EX

h

�

X 0X��1

i

= V ar (b) .

4.5. Exact tests of significance with normally distributed errors

Assume LRM. 5: "|X ⇠ N�

0,�2In�

. Since equation (4.2.1) holds,

b|X ⇠ N⇣

�,�2�

X 0X��1

⌘

.

Also, since e = M[X]", e|X ⇠ N�

0,�2M[X]

�

. Using a result in Rao (1973), it is also possible

to prove at once that b and e are also jointly normal with zero covariances, conditional on X.

Specifically, since0

@

b

e

1

A

=

0

@

�

0

1

A

+ �

0

@

(X 0X)

�1X 0

M[X]

1

A

"

�

then by (8a.2.9) in Rao (1973) it has0

@

b

e

1

A |X ⇠ N

2

4

0

@

�

0

1

A ,�2

0

@

(X 0X)

�1X 0

M[X]

1

A

⇣

X (X 0X)

�1 M[X]

⌘

3

5


or0

@

b

e

1

A |X ⇠ N

2

4

0

@

�

0

1

A ,

0

@

�2(X 0X)

�10

0 �2M[X]

1

A

3

5

Therefore, being normally distributed with zero conditional covariances, conditional on X, b

and e are also independent, conditional on X. This general result is important and therefore

stated as a theorem for future reference.

Theorem 19. Assume LRM.5. Then b and e are independent, conditional on X.

Exercise 20. Verify, by direct computation of Cov (b, e|X), that Cov (b, e|X) = 0k⇥n.

Solution: Since E (e|X) = 0 (verify), then1

Cov (b, e|X) = E⇥

(b� �) e0|X⇤

or

Cov (b, e|X) = Eh

�

X 0X��1

X 0""0M[X]|Xi

=

�

X 0X��1

X 0E⇥

""0|X⇤

M[X]

= �2�

X 0X��1

XM[X]

= 0k⇥n.

Exercise 21. Verify, by direct computation of V ar (e|X), that V ar (e|X) = �2M[X].

Exercise 22. Is V ar

2

4

0

@

b

e

1

A |X

3

5 non-singular? Why or why not?

1In general the matrix of conditional covariances between two random vectors x and y, conditional on z, is

E�[x� E (x|z)] [y � E (y|z)]0 |z

.


4.5.1. Testing single linear restrictions. Let (X 0X)

�1ii stand for the i.th main diagonal

element of (X 0X)

�1, then

bi|X ⇠ N⇣

�i,�2�

X 0X��1ii

⌘

and, given the properties of the normal distribution, bi can be standardized to have

(4.5.1)bi � �i

q

�2(X 0X)

�1ii

|X ⇠ N (0, 1) ,

i = 1, ..., k. Were �2 known, then the above statistics could be used to test hypotheses on �i,

Ho : �i = �⇤i , by replacing the unknown �i with �⇤

i , where �⇤i is a value of interest fixed by

the researcher. For example, to test Ho : �i = 0 one would use

biq

�2(X 0X)

�1ii

⇠ N (0, 1) .

The problem is, of course, that �2 is generally unknown and so the foregoing approach cannot

be used as it is. With some adjustment we can make it operational, though. Just replace �2

with s2 in the expression for the standardized bi to get

(4.5.2) ti =bi � �⇤

iq

s2 (X 0X)

�1ii

and then prove that ti has a t distribution with n� k degrees of freedom when �i = �⇤i . The

denominator term in expression (4.5.2) is the standard error estimate for coefficient bi.

First, notice that since s2 = e

0e/ (n� k) = "0M[X]"/ (n� k),

(4.5.3) (n� k)s2

�2=

⇣"

�

⌘0M[X]

⇣"

�

⌘

.

Now, consider the following distributional result

• if z ⇠ N (0, I), and A is a conformable idempotent matrix, then z

0Az ⇠ �2(p) with

p = rank (A) .


Since "/� ⇠ N (0, I) and M[X] is idempotent, then (n� k) s2/�2 is an idempotent quadratic

form in a standard normal vector and, in the light of the foregoing distributional result, has

a chi-squared distribution with degrees of freedom equal to rank�

M[X]

�

= n� k :

(n� k)s2

�2⇠ �2

(n� k) .

Since Theorem 19 holds, any function of b is independent of any function of e, conditionally

on X, hence (bi � �⇤i ) /

q

�2(X 0X)

�1ii and (n� k) s2/�2 are conditionally independent.

Further,

ti =bi � �⇤

iq

�2(X 0X)

�1ii

/

r

(n� k)s2

�2/ (n� k),

therefore, in the light of the following second distributional result

• if z ⇠ N (0, 1), x ⇠ �2(p) and z and x are independent, then z/

p

x/p has a t

distribution with p degrees of freedom

then, ti|X ⇠ t (n� k), i = 1, ..., k.

Finally, since the t distribution does not depend on the sample information and, specifically,

on X, then ti and any component of X are statistically independent, so that the above holds

also unconditionally, that is ti ⇠ t (n� k), i = 1, ..., k.

Often we wish to test hypotheses involving linear combinations of �, r0�, where r is a k⇥1

vector of known constants.

Example 23. If we are estimating a two-input Cobb-douglas production function and �1

and �2 are the product elasticities of the two inputs, the hypothesis of constant returns to

scale is clearly important, so that our null is �1 + �2 = 1.

In general we express the null involving a linear combination of population coefficients as

H0 : r

0� � q = 0, where q is a known constant. In the Cobb-Douglas example r = (1 1)

0 and

q = 1.


The OLS estimator for r0� is r0b, which is normally distributed conditional on X: r0b|X ⇠

Nh

r

0�,�2r

0(X 0X)

�1r

i

. Therefore we have the following t test

r

0b� q

q

s2r0 (X 0X)

�1r

⇠ t (n� k) ,

which can be used to test H0 : r

0� � q = 0.

4.5.1.1. Stata implementation. For the sake of exposition, I report here the regress output

already seen in Section 3.3.3.

.

_cons -67.58063 27.91068 -2.42 0.042 -131.9428 -3.218488 y .9792669 .031607 30.98 0.000 .9063809 1.052153 c Coef. Std. Err. t P>|t| [95% Conf. Interval]

Total 64972.1238 9 7219.12487 Root MSE = 8.193 Adj R-squared = 0.9907 Residual 537.004616 8 67.125577 R-squared = 0.9917 Model 64435.1192 1 64435.1192 Prob > F = 0.0000 F( 1, 8) = 959.92 Source SS df MS Number of obs = 10

. regress c y

. use "/Users/giovanni/didattica/Greene/dati/ch. 1/1_1.dta"

The OLS coefficient estimates, b, are displayed in the first column (labeled Coef.). Then,

the second column reports the standard error estimates peculiar to each OLS coefficients,

sei =q

s2 (X 0X)

�1ii , i = 1, ..., k. The third column reports the values of the t statistics for

the k null hypotheses �i = 0, i = 1, ..., k :

ti =bi

q

s2 (X 0X)

�1ii

.

The test is two-sided in that the alternative is H1 : �i > 0 or �i < 0. The fourth column

reports the so called p-value for the two-sided t-test. It is defined as the probability that

a t distributed random variable is more extreme than the outcome of ti in absolute value:

Pr [(t < � |ti|) [ (t > |ti|)] or more compactly Pr (|t| > |ti|) . Clearly, if the p-value is smaller


than the chosen size of the test (the Type I error) then ti falls for sure into the critical region

and we reject the null at the chosen size. In other words, the p-value indicates the lowest size of

the critical region (the lowest Type I error) we could have fixed to reject the null, given the test

outcome. In this sense, the p-value is more informative than critical values. In the regress

example, if we choose a critical region of 5% size, given that Pr (|t| > 2.42) = 0.042 < 0.05,

we can reject at 5% that the constant term is equal to zero, knowing that we could also have

rejected at, say, 4.5%, but not at 1%. A 1% size is smaller then the test p-value, which is

the minimum size allowing rejection, and for this reason we can’t reject at 1%. This is a clear

case of borderline significance, one which we could not have identified with such precision by

simply looking at the 5% critical values. On the other hand, the p-value for the coefficient

on y is virtually zero (as low as 0.000). This therefore indicates that no matter how much

conservative we are towards the null, we can reject it at any conventional level of significance

(conventional sizes, with an increasing degree of conservativeness are 10%, 5%, 1%) and also

at a less conventional 0.1% (since 0.001 > 0.000).

4.5.2. From tests to confidence intervals. Let us fix the ↵100% critical region for our

two-sided t test for the null H0 : �i = �⇤i against the alternative H1 : �i 6= �⇤

i and let ±t↵/2 be

the corresponding critical values: Pr��

t < �t↵/2�

[�

t > t↵/2�

= ↵. Then, the probability

of not rejecting the null when it is true is (1� ↵) . Formally,

Pr

⇢

�

�

�

�

bi � �⇤i

sei

�

�

�

�

< t↵/2

�

= Pr

⇢

�t↵/2 <bi � �⇤

i

sei< t↵/2

�

= Pr�

�seit↵/2 < bi � �⇤i < seit↵/2

= Pr�

bi � seit↵/2 < �⇤i < bi + seit↵/2

= (1� ↵) .

But�

bi � seit↵/2, bi + seit↵/2�

is a (1� ↵) 100% confidence interval for �i. This proves that the

(1� ↵) 100% confidence interval�

bi � seit↵/2, bi + seit↵/2�

contains all of the null hypotheses

�i = �⇤i that we cannot reject at ↵100%. So while a given t test is informative only for


the specific null it is testing, the confidence interval conveys to the researcher much more

information. The last column of the regress output reports the 95% confidence intervals for

each OLS coefficients.

Exercise 24. Your regression output for a given coefficient �i yields bi = �9.320 and

sei = 1.760. 1) Compute the t-statistic for the null H0 : �i = 0. 2) In your regression

n � k = 334, this implies that t0.025 = 1.967, where t0.025 : Pr (t > t0.025) = 0.025. Will you

reject or not H0 : �i = 0 against H1 : �i 6= 0 at a significance level of 5%? Why? 3) Given

your answer to Question 2, will you expect that 0 belongs to the 95% confidence interval for

�i? 4) Compute the 95% confidence interval for �i. On the basis of the information from the

confidence interval alone, do you reject H0 : �i = �6 against H1 : �i 6= �6 at 5%, why? 5)

Using only your answers to Question 4, can you assert that the p-value of that test is greater

than 0.05? Also, do you expect that the absolute value of the t statistic for H0 : �i = �6

be greater or smaller than 1.967, why? Verify your answer by directly computing the value of

the t statistic for H0 : �i = �6. 6) Consider now the test of H0 : �i 0 against H1 : �i > 0

with a 5% significance level. Is the critical level for this test equal to, smaller or greater than

1.967?

Answer: 1) ti = �5.295. 2) Reject, because 5.295 > 1.967. 3) No, since H0 : �i = 0

is rejected at 5%. 4) (�12.782,�5.858) . No, since �6 2 (�12.782,�5.858) . 5) Yes: since

H0 : �i = �6 is not rejected at 5%, then ti falls within the acceptance region and so the test

p-value> 0.05. Since ti falls within the acceptance region, the value of |ti| must be smaller

than t0.025 = 1.967. Indeed, ti = �1.886. 6) Smaller: since the test is one-sided, the critical

value is t0.05.

Exercise 25. Your regression output for a given coefficient �i yields bi = 6.668 with

sei = 3.577. The outcome of the t-test for H0 : �i = 0 against H1 : �i 6= 0 shows p-

value= 0.07. Can you reject the null at 10%? Can you at 5%?


4.5.3. Testing joint linear restrictions. We want to test jointly J linear restrictions:

H0 : R�� q = 0, where R and q are, respectively, a J ⇥ k matrix and a J ⇥ 1 vector of fixed

known constants and such that no rows in R can be obtained as a linear combination of the

others, that is R is of full row rank J .

Under the null,

Rb� q = R (b� �)

and so given LRM.5,

Rb� q|X ⇠ N⇣

0,�2R�

X 0X��1

R0⌘

.

Then, using the distributional result that

• given the p⇥ 1 random vector x ⇠ N (µ,⌃), then (x� µ)0⌃�1(x� µ) ⇠ �2

(p) ,

it has

W =

(Rb� q)

0h

R (X 0X)

�1R0i�1

(Rb� q)

�2|X ⇠ �2

(J) .

Again, �2 is not known and so W is unfeasible as a test for H0. We can go about as in the

previous section and replace �2 with s2. In addition, then, divide the result by J to get the

statistic

F =

(Rb� q)

0h

R (X 0X)

�1R0i�1

(Rb� q)

Js2

Now consider another distributional results

• Given two independent random scalar x1 ⇠ �2(p1) and x2 ⇠ �2

(p2), then (x1/p2) / (x1/p2) ⇠

F (p1, p2) .

It is not hard to see that the above result can be applied to F, since it can be reformulated

as the ratio of two conditionally independent chi-squared random variables corrected by their

own degrees of freedoms. In fact, at the numerator we have

(Rb� q)

0h

R (X 0X)

�1R0i�1

(Rb� q)

J�2


and at the denominator s2/�2. Conditional on X, the former is a function of b alone, while the

latter is a function of e alone. Therefore, in the light of Theorem 19 the two are conditionally

independent and so we can invoke the foregoing distributional result, to establish F |X ⇠

F (J, n� k).

As with the t statistic, since the F distribution does not depend on the sample information,

we have that the above holds unconditionally: F ⇠ F (J, n� k) .

When H0 is a set of J exclusion restrictions, then q = 0 and each row of R has all zero

elements except unity in the entry corresponding to the parameter that is set to zero. For

example, with three parameters �0= (�1 �2 �3) and two exclusion restrictions �1 = 0 and

�3 = 0, then J = 2, q0= (0 0) and

R =

0

@

1 0 0

0 0 1

1

A

so that H0 can be formulated as

0

@

1 0 0

0 0 1

1

A

0

B

B

B

@

�1

�2

�3

1

C

C

C

A

=

0

@

0

0

1

A .

The F -test can be always rewritten as a function of the residual sum of squares under the

unrestricted model, e

0e, and the residual sum of squares under the model with restrictions

imposed, say e

0⇤e⇤ :

F =

(e

0⇤e⇤ � e

0e) /J

e

0e/ (n� k)

.

This is proved for the case of exclusion restrictions by using Lemma 12.


Partition the sample regressor matrix as X = (X1 X2) and consider the F test for the set

of exclusion restrictions H0 : �2 = 0:

F =

b

02

h

�

X 02M[X1]X2

��1i�1

b2

k2s2

=

b

02X

02M[X1]X2b2

k2s2.

Now apply the FWL Theorem to F to have

F =

y

0M[X1]X2�

X 02M[X1]X2

��1X 0

2M[X1]X2�

X 02M[X1]X2

��1X 0

2M[X1]y

k2s2.

=

y

0M[X1]X2�

X 02M[X1]X2

��1X 0

2M[X1]y

k2s2.

The numerator of the right hand side of the foregoing equation can be written more compactly

as y

0P[

M[X1]X2]

y. Hence, by Lemma 12,

F =

y

0 �P[X] � P[X1]

�

y

k2s2

or, adding and subtracting In within parentheses,

F =

y

0 �M[X1] �M[X]

�

y

k2s2

=

(e

0⇤e⇤ � e

0e) /k2

s2.

It is not hard to prove that if the constant term is kept in both models, then

F =

�

R2 �R2⇤�

/J

(1�R2) / (n� k)

,

where R2 is R-squared from the unrestricted model and R2⇤ is the R-squared from the restricted

model.

4.6. THE GENERAL LAW OF ITERATED EXPECTATION 58

4.6. The general law of iterated expectation

The general form of the law of iterated expectations (LIE) can be stated as in Wooldridge

(2010), pp. 19-20.

LIE(scalar|vector): Given the random variable y and the random vectors w and x,

where x = f (w), then E (y|x) = E [E (y|w) |x].

Since the above result holds for any function f (·), x can just be any subvector of w, as the

following example shows.

Example 26. Consider w = (w1 w2 w3)0 and x = Aw, where

A =

0

@

1 0 0

0 1 0

1

A

then x = (w1 w2)0 and so by the general law of iterated expectations

E (y|w1 w2) = E [E (y|w1 w2 w3) |w1 w2] .

The law can, of course, be formulated in terms of conditional expectations of random

vectors.

LIE(vector|vector): Given the random vector y and the random vectors w and x,

where x = f (w), then E (y|x) = E [E (y|w) |x] , where

E (y|x) =

0

B

B

B

@

E (y1|x)...

E (yn|x)

1

C

C

C

A

and E (y|w) =

0

B

B

B

@

E (y1|w)

...

E (yn|w)

1

C

C

C

A

.

Remark 27. Notice that in the formulation of conditional expectations the way the con-

ditioning set is represented is just a matter of notational convenience. What matters are the

random scalars that enter the conditioning set and not the way they are organized therein.

For example, E (y|w1, w2 , w3 , w4) can just be equivalently expressed as E (y|w0) or E (y|w),

4.7. THE OMITTED VARIABLE BIAS 59

where w = (w1 w2 w3 w4)0, or E (y|W ) where

W =

0

@

w1 w3

w2 w4

1

A ,

or through any other organization of (w1, w2, w3, w4).

Given Remark 27 the general LIE can be formulated with conditional expectations having

the conditioning set organized in the form of random matrices rather than random vectors, as

follows.

LIE(vector|matrix): Given the random vector y and the random matrices W and X,

where X = f (W ), then E (y|X) = E [E (y|W ) |X].

Paralleling the consideration made above, since f (·) is a generic function, from LIE(vector|matrix)

follows a special LIE for the case in which X is a submatrix of W . Therefore, given W =

(W1 W2), by LIE(vector|matrix) we always have

E (y|W1) = E [E (y|W ) |W1]

and

E (y|W2) = E [E (y|W ) |W2] .

4.7. The omitted variable bias

If explanatory variables that are relevant in the population model, for some reasons, are not

included into the statistical model - they may be intrinsically latent, such as individual skills,

or the specific data-set in use do not report them, or also, although observed and available,

the researcher failed to account for them in the model specification - then our OLS estimator

may undergone what is known in the econometric literature as an omitted variable bias. Let’s

see when and why.


Assume that the population model is

y = x

0� + ".

with x and � both k ⇥ 1 vectors and P.1-P.4 satisfied and consider the RS mechanism

RS: There is a sample of size n from the population equation, such that the elements of

the sequence {(yi xi1 xi2 ...xik) , i = 1, ..., n} are independently identically distributed

(i.i.d.) random vectors.

So far we are in the classical regression framework, but now let x

0= (x

01 x

02) with x1 being

a k1 ⇥ 1 vector and x2 a k2 ⇥ 1 vector and k = k1 + k2 and maintain that x2 is latent

or, however, not included into the statistical model and let’s explore the implications on the

statistical model. P.1 implies that

(4.7.1) y = X1�1 + ⇣

where ⇣ = X2�2 + " is now the new n⇥ 1 vector of latent realizations. If X is of f.c.r., then

X1 is of f.c.r. as well. So, in a sense, LRM.1 and LRM.2 continue to hold. But, as far as

LRM.3 and LRM.4 are concerned, nothing can be said. Specifically, we do not know whether

E (⇣|X1) = 0 or V ar (⇣|X1) = &2In. The first consequence is that the OLS estimator for �1,

b1 =�

X 01X1

��1X 0

1y

is likely to be biased. Indeed, the bias can be easily derived as follows. Replacing the right

hand side of equation (4.7.1) into the OLS formula yields

b1 = �1 +�

X 01X1

��1X 0

1⇣

= �1 +�

X 01X1

��1X 0

1X2�2 +�

X 01X1

��1X 0

1".


Since RS holds, "i, i = 1, .., n, is conditional-mean independent of (x0i1 x

0i2) and statistically

independent of⇣

x

0j1 x

0j2

⌘

, j 6= i = 1, ..., n. Therefore, E ("|X) = 0, which implies that

E (b1|X) = �1 +�

X 01X1

��1X 0

1X2�2

and hence, by the law of iterated expectations, we have the unconditional bias

E (b1)� �1 = Eh

�

X 01X1

��1X 0

1X2�2

i

.(4.7.2)

There are two specific instances, however, in which the bias is zero.

The first instance is that analyzed in Greene (2008) when X 01X2 = 0k1⇥k2 . In this case

(X 01X1)

�1X 01X2�2 = 0 and so the bias in equation (4.7.2) becomes zero.

The second instance occurs if in the population E (x

02�2|x1) = 0, as I now show. Since in

the population E ("|x) = 0, then by the general law of iterated expectation also E ("|x1) = 0.

Hence, E (x

02�2 + "|x1) = 0, which along with RS yields E (⇣|X1) = 0. Therefore, the ⇣ vector

in model (4.7.1) behaves like a conventional error term that satisfies LRM.3. The upshot is

that b1 is unbiased.

The two situations are not related. Clearly, E (X2�2|X1) = 0 does not imply X 01X2 =

0k1⇥k2 . But also the converse is not true, and X 01X2 = 0k1⇥k2 may hold if E (X2�2|X1) 6= 0,

as shown by the following example.

Example 28. Let y = x1�1 + x2�2 + " with x1 a Bernoulli random variable:

Pr (x1 = 1) = ⇢ and Pr (x1 = 0) = 1� ⇢,

Let also x2 = 1 � x1. While E (x2�2|x1) = (1� x1)�2, x1x2 = 0 with probability one. In

this case, a random sampling of y, x1 and x2 from the population will yield X 01X2 = 0 and

E (X2�2|X1) 6= 0 with probability one.

Be that as it may, the foregoing two instances of unbiasedness constitute a narrow case,

and in general omitted variables will bring about bias and inconsistency in the coefficient


estimates. Solutions are typically given by proxy variables, panel data estimators and instru-

mental variables estimators. The first method is briefly described below, the classical panel

data estimators are pursued in Chapter 8, while IV methods are described in Chapter 10.

To conclude, observe that if relevant variables are omitted LRM.4 does not generally hold,

unless V ar (x02�2|x1) = &2 < +1.

Lemma 29. Given any two non-singular square matrices of the same dimension, A and

B, if A�B is n.n.d. then B�1 �A�1 is n.n.d.

The foregoing lemma signifies that in the space of non-singular square matrices of a given

dimension if A is “no smaller” than B, then A�1 is “no greater” than B�1. It is useful in

situations in which the difference of inverse matrices is more easily worked out than that of

the original matrices.

The following exercise asks you to think through the consequences of overfitting, namely

applying OLS to a statistical model with variables that are redundant in the population model.

Exercise 30. Assume the population model is

y = x

0� + "

with x and � both k ⇥ 1 vectors and P.1-P.4 satisfied. Assume also that the l ⇥ 1 vector z of

observable variables is available, such that rank [E (ww

0)] = k + l where w

0= (x

0z

0). Also,

assume E ("|x z) = 0 and V ar ("|x z) = �2, i.e. z is redundant in the population equation.

Finally assume there is a sample of size n from the population, such that the elements of the

sequence {(yi x0i z

0i) , i = 1, ..., n} are i.i.d. 1⇥ (1 + k + l) random vectors. Applying the usual


notation for the sample variables,

y

n⇥1=

0

B

B

B

B

B

B

B

B

B

B

@

y1...

yi...

yn

1

C

C

C

C

C

C

C

C

C

C

A

, Xn⇥k

=

0

B

B

B

B

B

B

B

B

B

B

@

x

01...

x

0i...

x

0n

1

C

C

C

C

C

C

C

C

C

C

A

, Zn⇥l

=

0

B

B

B

B

B

B

B

B

B

B

@

z

01...

z

0i...

z

0n

1

C

C

C

C

C

C

C

C

C

C

A

, "n⇥1

=

0

B

B

B

B

B

B

B

B

B

B

@

"1...

"i...

"n

1

C

C

C

C

C

C

C

C

C

C

A

,

answer the following questions. 1) Prove that the statistical model

y = X� + "

satisfies LRM.1-LRM.4 (of course, this proves that

(4.7.3) b =

�

X 0X��1

X 0y

is BLUE). 2) Prove that the overfitting strategy of regressing y on X and Z yields an unbiased

estimator for � and call it bofit. 3) Derive the covariance matrix of bofit. 4) Use lemma 29

and verify that, indeed, the conditional covariance matrix of bofit is “no smaller” than that of

b in (4.7.3). 5) A byproduct of the overfitting strategy is the l ⇥ 1 vector of OLS coefficients

for the variables in Z. Let’s call it c. Express c using the intermediate result in the FWL

theorem as

c =

�

Z 0Z��1

Z 0(y �Xbofit)

and prove that the overfitting residual vector eofit ⌘ y �Xbofit � Zc equals

eofit = M[

M[Z]X]M[Z]y.

6) Find an unbiased estimator for �2 based on eofit.

Answer: 1) Obvious, since in the population and the sampling mechanism we have all

we need for the statistical properties LRM.1-LRM.4 to be true. 2) This is proved at once by


noting that from RS and E ("|x z) = 0, E ("|X Z) = 0. 3) Prove that V ar ("|X Z) = �2I and

then prove that

V arh

�

X 0M[Z]X��1

X 0M[Z]y|X Zi

= �2�

X 0M[Z]X��1

.

4) Write X 0M[Z]X as X 0M[Z]X = X 0X �X 0P[Z]X and then verify you have all is needed to

invoke the lemma. 5) Easy, it’s just algebra: replace bofit and c into eofit ⌘ y�Xbofit �Zc

and rearrange. 6) First, use the formula of the overfitting residual vector derived in the

previous question, M[

M[Z]X]M[Z]y, to set up the estimator

s2 =y

0M[Z]M[

M[Z]X]M[Z]y

n� k � l.

Then, notice that

s2 ="0M[Z]M

[

M[Z]X]M[Z]"

n� k � l.

Finally, take the expectation of the trace s2, conditional on X and Z, and follow the same

steps as when proving unbiasedness of the standard estimator s2. In the derivation don’t

forget that M[Z] and M[

M[Z]X]are orthogonal projections.

4.7.1. The proxy variables solution. Assume for simplicity that there is only one

omitted variable x2 from the population equation

(4.7.4) y = x

01�1 + x2�2 + ".

where x1 is a k1 ⇥ 1 vector of observed regressors. Assume also that there is a l ⇥ 1 vector of

observed variables z such that the following assumptions hold.

(1) The z variables are redundant in the population equation, that is E (y|x z) = x

0�.

(2) Once conditioning on z, the omitted variable x2 and the included explanatory vari-

ables, x1, are independent in conditional-mean: E (x2|x1 z) = E (x2|z) . Also E (x2|z) =

z

0�.

4.8. THE VARIANCE OF AN OLS INDIVIDUAL COEFFICIENT 65

(3) rank

8

<

:

E

2

4

0

@

x1

z

1

A

(x

01 z

0)

3

5

9

=

;

= k1+ l. This is analogous to property P.2 in Chapter

1 and permits identification of coefficients in the proxy variable regression as we will

see below.

Assumption 2 implies that x2 can be written as

(4.7.5) x2 = z

0� + ⌘,

where ⌘ = x2�E (x2|x1 z), and hence E (⌘|x1 z) = 0. Replacing the right hand side of equation

(4.7.5) into the population equation (4.7.4) yields

(4.7.6) y = x

01�1 + z

0(�2�) + �2⌘ + ",

where E (�2⌘|x1 z) = 0. Also, by the redundancy assumption, E (y|x z) = x

0�, it follows that

E ("|x z) = 0 and so by the general LIE

E ("|x1 z) = E [E ("|x z) |x1 z] = 0.

It follows that E (�2⌘ + "|x1 z) = 0 and so, along P.1 and P.2 (given Assumption 3), also P.3

is satisfied for equation (4.7.6). With the following RS mechanism

RS(x1 z): There is a sample of size n from the population, such that the elements of

the sequence {(yi xi1 ...xik1 zi1 ...zil) , i = 1, ..., n} are independently identically dis-

tributed (i.i.d.) random vectors,

the resulting statistical model will satisfy LRM.1-LRM.3 and so yield unbiased OLS estimates.

4.8. The variance of an OLS individual coefficient

Suppose that attention is centered onto a given explanatory variable whose observations

are collected into the (n⇥ 1) column vector xi, and that there are k � 1 control variables

collected into the n ⇥ (k � 1) matrix X�i. Without loss of generality partition the (n⇥ k)


regressor matrix as X = (X�i xi) and, correspondingly, the (k ⇥ 1) OLS vector as

b =

0

@

b�i

bi

1

A ,

where b�i is (k � 1)⇥ 1 and bi is a scalar. Maintain LRM.1-LRM.4, so that

V ar (bi|X) = �2�

X 0X��1ii

where (X 0X)

�1ii indicates the last entry onto the main diagonal of (X 0X)

�1.

My purpose here is to derive the formula for (X 0X)

�1ii . As usual when it comes to the

derivation of formulas regarding parts of the OLS vector, I invoke the Frisch-Waugh-Lovell

(FWL) Theorem. Hence,

bi =�

x

0iM[X�i

]xi��1

x

0iM[X�i

]y,

so, given

y = X�i��i + xi�i + ",

it has

bi =�

x

0iM[X�i

]xi��1

x

0iM[X�i

]

�

X�i��i + xi�i + "�

and consequently

bi = �i +�

x

0iM[X�i

]xi��1

x

0iM[X�i

]".

Finally

V ar (bi|X) = Eh

�

x

0iM[X�i

]xi��1

x

0iM[X�i

]""0M[X�i

]xi�

x

0iM[X�i

]xi��1 |X

i

=

�

x

0iM[X�i

]xi��1

x

0iM[X�i

]E�

""0|X�

M[X�i

]xi�

x

0iM[X�i

]xi��1

= �2�

x

0iM[X�i

]xi��1

=

�2

x

0iM[X�i

]xi,(4.8.1)


which also proves that

(4.8.2)�

X 0X��1ii

=

1

x

0iM[X�i

]xi.

Equation (4.8.2) is a general algebraic result providing the formula for the generic i.th main

diagonal element of the inverse of any non-singular cross-product matrix X 0X. I have proved it

in quite a peculiar way, using a well-known and easy-to-remember econometric result! Above

all, I could get away without referring to the hard-to-remember result on the inverse of the

(2⇥ 2) partitioned matrix, which is instead the route followed by Greene (Theorem 3.4 in-

Greene (2008), p. 30) .

Exercise 31. Prove (4.8.2) using formula (A-74) for the inverse of the (2⇥ 2) partitioned

matrix in Greene (2008), p. 966.

4.8.1. The three determinants of V ar (bi|X) when 1 is a regressor. Now I get back

to V ar (bi|X) in equation (4.8.1)

V ar (bi|X) = �2�

x

0iM[X�i

]xi��1

and assume that X�i contains the n⇥ 1 unity vector 1, or X�i =

⇣

˜X�i 1

⌘

. Notice, now, that

M[X�i

]xi is the residual vector from the OLS regression of xi on X�i and so x

0iM[X�i

]xi is the

residual sum of squares for this regression. Since the unity vector is a column of X�i , the

coefficient of determination for this regression is

R2i = 1�

x

0iM[X�i

]xi

x

0iM[1]xi

,

from which we have that

x

0iM[X�i

]xi =�

1�R2i

�

x

0iM[1]xi


and eventually2

V ar (bi|X) =

�2�

1�R2i

�

x

0iM[1]xi

Also, it has

x

0iM[1]xi =

nX

j=1

(xij � xi)2 ,

that is x

0iM[1]xi is the total variation in xi around its sample mean, xi. Therefore,

(4.8.3) V ar (bi|X) =

�2

�

1�R2k

�

Pnj=1 (xji � xi)

2 ,

with V ar (bi|X) increasing when

• other things constant, R2i increases, in words the correlation between xi and the

other regressors increases (this is the multicollinearity effect on the variance of the

OLS individual coefficient);

• other things constant, the total variation in xi,Pn

j=1 (xji � xi)2, decreases;

• other things constant, the regression variance increases.

2An alternative proof is the following. Given Lemma 12, M[X�i] = I � P[1] � P[M[1]X�i] and so

V ar (bi

|X) = �2hx

0i

⇣M[1] � P[M[1]X�i]

⌘x

i

i�1

or

V ar (bi

|X) = �2⇣x

0i

M[1]xi

� x

0i

P[M[1]X�i]xi

⌘�1

= �2⇣x

0i

M[1]xi

� x

0i

P[M[1]X�i]xi

⌘�1

= �2

"x

0i

M[1]xi

1�

x

0i

P[M[1]X�i]xi

x

0i

M[1]xi

!#�1

.

= �2 ⇥x

0i

M[1]xi

�1�R2

i

�⇤�1

where

R2i

⌘x

0i

P[M[1]X�i]xi

x

0i

M[1]xi

.

Given (3.8.5), R2i

is the centered R-squared for the regression of x

i

, on X�i

(or equivalently the uncentered

R-squared from the regression of M[1]xi

on M[1]X�i

).


Multicollinearity is perfect when xi belongs to R⇣

˜X�i

⌘

. In this case R2i = 1 (see Section

3.7) and the variance of bi diverges to infinity. Coefficient �i cannot be estimated given the

available data (X is not of f.c.r. in this case).

Remark 32. Multicollinearity, when it does not degenerate into perfect multicollinearity,

i.e. det (X 0X) = 0, does not affect the finite sample properties of OLS. Nonetheless, it may

severely reduce the precision of our estimates, in terms of larger standard errors and confidence

intervals.

Exercise 33. Partition X as X =

⇣

˜X 1

⌘

and accordingly, the OLS (k ⇥ 1) vector as

b =

⇣

˜

b

0 b0⌘0

, where ˜

b is of dimension (k � 1)⇥ 1 and b0 is the OLS estimator of the constant

term �0. Prove that

b0 = y � ˜

x

0˜

b

where y is the sample mean of y and ˜

x is the (k � 1) ⇥ 1 vector of sample means for the ˜X

regressors (hint: just use the intermediate result from the proof of the FWL Theorem that

b1 = (X 01X1)

�1X 01 (y �X2b2)).

Exercise 34. Use exercise 33 and the following three facts:

1) V ar (y|X) = En

[y � E (y|X)]

2 |Xo

= E�

"2|X�

=

�2

n,

2) V ar⇣

˜

x

0˜

b|X⌘

=

˜

x

0V ar

⇣

˜

b|X⌘

˜

x

4.9. RESIDUALS FROM PARTITIONED OLS REGRESSIONS 70

and

3) cov⇣

y, ˜x0˜

b|X⌘

= E

(y � E (y|X))

⇣

˜

x

0˜

b� E⇣

˜

x

0˜

b|X⌘⌘0

|X�

= E

""0M[1]˜X⇣

˜X 0M[1]˜X⌘�1

˜

x|X�

= E

1

n1

0""0M[1]˜X⇣

˜X 0M[1]˜X⌘�1

˜

x|X�

=

�2

n1

0M[1]˜X⇣

˜X 0M[1]˜X⌘�1

˜

x = 0,

to prove that

V ar (b0|X) =

�2

n+

˜

x

0V ar

⇣

˜

b|X⌘

˜

x.

4.9. Residuals from partitioned OLS regressions

Consider the partitioned OLS regression of M[X2]y on the columns in M[X2]X1 as regressors

and the corresponding OLS residual vector

u = M[X2]y �M[X2]X1b1.

I now prove that u is equal to the OLS residual vector e ⌘ M[X]y. Since b1 =�

X 01M[X2]X1

��1X 0

1M[X2]y,

replacing it into the right hand side of the u equation yields

u = M[X2]y �M[X2]X1�

X 01M[X2]X1

��1X 0

1M[X2]y

= M[X2]y � P[

M[X2]X1]

y.

By Lemma 12 it has P[X] = P[X2] + P[

M[X2]X1]

and so M[X] = M[X2] � P[

M[X2]X1]

or M[X2] =

M[X] + P[

M[X2]X1]

. Then

u =

⇣

M[X] + P[

M[X2]X1]

⌘

y � P[

M[X2]X1]

y

= M[X]y

⌘ e.

CHAPTER 5

The Oaxaca’s model: OLS, optimal weighted least squares and

group-wise heteroskedasticity

5.1. Introduction

The Oaxaca’s model is a good way to check your comprehension of things so far. The

treatment is more complete than Greene (2008)’s. Importantly, it serves as a motivation of the

Zyskind’s condition, introduced in Section 5.4. It may also serve as an introduction to a number

of topics that will be covered later on: in particular, dummy variables; heteroskedasticity;

generalized least squares estimation; panel data models. Do Exercise 39 at the end.

5.2. Embedding the Oaxaca’s model into a pooled regression framework

We have 2 statistically independent samples, not necessarily of equal sizes. 1) A sample

from the population of male workers, with observations for the log(wage), collected into the

(nm ⇥ 1) column vector ym, and socio-demographic explanatory variables collected into the

(nm ⇥ k) sample regressor matrix Xm; 2) a sample from the population of female workers with

the same variables collected into the (nf ⇥ 1) column vector yf and the (nf ⇥ k) matrix Xf ,

respectively.

Assume that both population models

ym = Xm�m + "m

yf = Xf�f + "f

meet LRM.1-LRM.5 with regression variances, �2m = �2!2

m and �2f = �2!2

f , not necessar-

ily equal (group-wise heteroskedasticity). Hence, the resulting OLS estimators from the two71

5.2. EMBEDDING THE OAXACA’S MODEL INTO A POOLED REGRESSION FRAMEWORK 72

separate regressions, bm and bf , are independently normally distributed, with bm|Xm ⇠

Nh

�m,�2m (X 0

mXm)

�1i

and bf |Xf ⇠ N

�f ,�2f

⇣

X 0fXf

⌘�1�

and are both BLUE.

The question I ask is whether it is possible to embed the two models into a single regression

model by pooling the two sub-samples into a larger one with size equal to n = nm + nf , and

continue to estimate �m and �f efficiently.

Let’s try and see. Here is the pooled data-set

y =

0

@

ym

yf

1

A , Xw =

0

@

Xm

Xf

1

A , " =

0

@

"m

"f

1

A .

Let 1 denote the (n⇥ 1) vector of all unity elements and construct the (n⇥ 1) vector d, such

that its first nm entries have all unity elements and the last nf all zero elements.

Variables like d are usually referred to as dummy variables or indicator variables, since

they indicate whether any observation in the sample belongs or not to a given group. In this

particular case, d is the male dummy variable indicating whether any observation in the sample

is specific to the male group. Since the two groups are mutually exclusive, the female dummy

variable, indicating whether any observation in the sample belongs to the female group, can be

constructed as the complementary vector 1�d. By construction, d and 1�d are orthogonal,

that is d

0(1� d) = 0.

Let x

0wi be a (1⇥ k) row vector indicating the i.th row of Xw and yi, "i and di be scalars

indicating the i.th component of y, " and d, respectively.

With this in hand, the model for the generic worker i = 1, ..., n is

(5.2.1) yi = dix0wi�m + (1� di)x

0wi�f + "i

On setting up the (2k ⇥ 1) parameter vector � as

� =

0

@

�m

�f

1

A


and the (n⇥ 2k) regressor matrix X as

X =

0

@

Xm 0(nm

⇥k)

0

(

nf

⇥k)

Xf

1

A ,

where 0(s⇥t) indicate a (s⇥ t) matrix of all zero elements, model (5.2.1) can be reformulated

in matrix form as

(5.2.2) y = X� + ".

Exercise 35. Prove that X has f.c.r. if and only if both Xm and Xf have f.c.r.

Summing up, we have two equivalent representations of the same model: 1) that in Greene

(2008), with the two separate regressions; 2) that presented here with a single regression model,

represented by (5.2.2). It turns out that the two frameworks are equivalent, as far as efficient

estimation of the population coefficients is concerned. Indeed, as I prove next, the OLS

estimator, b, from model (5.2.2) is numerically identical to the OLS estimators from the two

separate regressions as presented in Greene (2008), i.e. b =

⇣

b

0m b

0f

⌘0. Let

b =

�

X 0X��1

X 0y.

By construction,

X 0y =

0

@

X 0mym

X 0fyf

1

A

and

X 0X =

0

@

X 0mXm 0(k⇥k)

0(k⇥k) X 0fXf

1

A .

Then, by a well know property of the inverse of a block diagonal matrix (see (A-73) in Greene

(2008))

�

X 0X��1

=

0

@

(X 0mXm)

�10(k⇥k)

0(k⇥k)

⇣

X 0fXf

⌘�1

1

A .


Hence,

b =

0

@

(X 0mXm)

�10(k⇥k)

0(k⇥k)

⇣

X 0fXf

⌘�1

1

A

0

@

X 0mym

X 0fyf

1

A

=

0

@

(X 0mXm)

�1X 0mym

⇣

X 0fXf

⌘�1X 0

fyf

1

A

=

0

@

bm

bf

1

A .


b =

0

@

bm

bf

1

A

using the FWL Theorem.

It must be pointed out that model (5.2.2) does not satisfy assumption LRM.4. The dis-

turbances ", although independently distributed, suffer from what is usually referred to as

group-wise heteroskedasticity, as the model does not maintain �2m = �2

f . Indeed, the covari-

ance matrix for " is

⌃ = �2

0

B

@

!2mIn

m

0

(

nm

⇥nf

)

0

(

nf

⇥nm

)

!2fInf

1

C

A

.

In this sense, model (5.2.2) is not a classical regression model. Does this mean that b is not

BLUE? No, and for an important reason. Assumptions LRM.1-LRM.4 are sufficient for the

OLS estimator to be BLUE, as it has been proved in Section 4.3, but not necessary. In specific

circumstances, even if LRM.4 is not met, the OLS estimator is still BLUE, and the Oaxaca’s

model is one such case. This is verified in the next section. A general necessary and sufficient

condition for the OLS to be BLUE is postponed to the last section of this tutorial.

5.3. THE OLS ESTIMATOR IN THE OAXACA’S MODEL IS BLUE 75

5.3. The OLS estimator in the Oaxaca’s model is BLUE

Model (5.2.2) can be transformed into a classical regression model by using a standard

procedure in econometrics and statistics: weighting. Let

H =

0

B

@

!�1m In

m

0

(

nm

⇥nf

)

0

(

nf

⇥nm

)

!�1f In

f

1

C

A

.

As stated by the exercise below, the matrix H when premultiplied to any conformable

vector transforms the vector so that its first nm elements get divided by !m and the remaining

by !f . This is what we refer to as “weighting”.

Exercise 37. Verify by direct inspection that, given any (nm ⇥ 1) vector xm, any (nf ⇥ 1)

vector xf and

x =

0

@

xm

xf

1

A ,

then

Hx =

0

@

!�1m xm

!�1f xf

1

A .

Premultiply both sides of model (5.2.2) by H:

Hy = HX� +H",

or

(5.3.1) e

y =

˜X� +

˜"

where the tilde indicates weighted variables. Two important facts are worth observing at this

point. First, the population parameters vector, �, in the weighted model is the same as in

model (5.2.2). Second, the weighted errors satisfy LRM.4 with covariance matrix equal to

5.3. THE OLS ESTIMATOR IN THE OAXACA’S MODEL IS BLUE 76

�2In, (so, if LRM.5 holds they are independent standard normal variables) since

V ar⇣

e"| eX⌘

= H⌃H 0= H⌃H

=

0

B

@

!�1m In

m

0

(

nm

⇥nf

)

0

(

nf

⇥nm

)

!�1f In

f

1

C

A

0

B

@

�2mIn

m

0

(

nm

⇥nf

)

0

(

nf

⇥nm

)

�2fInf

1

C

A

0

B

@

!�1m In

m

0

(

nm

⇥nf

)

0

(

nf

⇥nm

)

!�1f In

f

1

C

A

= �2

0

B

@

!mInm

0

(

nm

⇥nf

)

0

(

nf

⇥nm

)

!fInf

1

C

A

0

B

@

!�1m In

m

0

(

nm

⇥nf

)

0

(

nf

⇥nm

)

!�1f In

f

1

C

A

= �2In.

Therefore, the weighted model is a classical regression model that identifies the parameters of

interest, and hence, by the Gauss-Marcov Theorem, the OLS estimator applied to the weighted

model (5.3.1), referred to as the weighted least squares estimator (WLS), bw, is BLUE for �.

Let us work out its formula, using exercise 37:

bw =

0

@

�

!�2m X 0

mXm��1

0(k⇥k)

0(k⇥k)

⇣

!�2f X 0

fXf

⌘�1

1

A

0

@

!�2m X 0

mym

!�2f X 0

fyf

1

A

=

0

@

!2m (X 0

mXm)

�10(k⇥k)

0(k⇥k) !2f

⇣

X 0fXf

⌘�1

1

A

0

@

!�2m X 0

mym

!�2f X 0

fyf

1

A

=

0

@

(X 0mXm)

�1X 0mym

⇣

X 0fXf

⌘�1X 0

fyf

1

A ,

which proves that b = bw, namely that in the Oaxaca’s models the OLS estimator coincides

with the optimal WLS estimator.

Does this imply that we can do inference in the Oaxaca’s model feeding the Stata regress

command with the variables of model (5.2.2) without further cautions? Not quite. Although

5.4. A GENERAL RESULT 77

the single OLS regression provides the BLUE estimator for the population coefficients �, the

OLS estimate of V ar (b|X) that would be computed by regress,

˜V ar (b|X) = s2

0

@

(X 0mXm)

�10(k⇥k)

0(k⇥k)

⇣

X 0fXf

⌘�1

1

A ,

with s2 obtained from the sum of squares of the pooled residuals, is biased. The reason is

that ˜V ar (b|X) forces the regression variance estimate to be constant across the two samples.

Luckily, the same is not true for the separate regressions on the two subsamples, providing us

with the unbiased estimators of the model coefficients, bm and bf and the unbiased estimator

of the covariance matrix

ˆV ar (b|X) =

0

@

s2m (X 0mXm)

�10(k⇥k)

0(k⇥k) s2f

⇣

X 0fXf

⌘�1

1

A

where s2m =

1nm

�k

Pnm

i=1 e2i and s2f =

1nf

�k

Pni=n

m

+1 e2i . Alternatively, one can implement

a feasible version of the weighted regression explained above, using sm and sf as weights.

But this is clearly more computationally cumbersome than carrying out the two separate

regressions.

5.4. A general result

Zyskind (1967) provides a general necessary and sufficient condition for the OLS estimator

to be BLUE.

Theorem 38. Given the regressor matrix, X, and the conditional covariance matrix ,

V ar ("|X) = ⌃, the OLS estimator, b = (X 0X)

�1X 0y, is BLUE if and only if P[X]⌃ = ⌃P[X].

If LRM.4 holds, that is ⌃ = �2In, Zyskind’s condition holds for any X, since

P[X]⌃ = P[X]In�2= �2P[X].


That Zyskind’s condition is also verified in the Oaxaca’s model is straightforwardly proved

upon elaborating P[X] :

P[X] = X�

X 0X��1

X 0

=

0

@

Xm 0(nm

⇥k)

0

(

nf

⇥k)

Xf

1

A

0

@

(X 0mXm)

�10(k⇥k)

0(k⇥k)

⇣

X 0fXf

⌘�1

1

A

0

@

X 0m 0

(

k⇥nf

)

0(k⇥nm

) X 0f

1

A

=

0

@

Xm (X 0mXm)

�10(n

m

⇥k)

0

(

nf

⇥k)

Xf

⇣

X 0fXf

⌘�1

1

A

0

@

X 0m 0

(

k⇥nf

)

0(k⇥nm

) X 0f

1

A

=

0

B

@

Xm (X 0mXm)

�1X 0m 0

(

nm

⇥nf

)

0

(

nf

⇥nm

)

Xf

⇣

X 0fXf

⌘�1X 0

f

1

C

A

=

0

B

@

P[Xm

] 0

(

nm

⇥nf

)

0

(

nf

⇥nm

)

P[

Xf

]

1

C

A

Therefore,

⌃P[X] =

0

B

@

�2mIn

m

0

(

nm

⇥nf

)

0

(

nf

⇥nm

)

�2fInf

1

C

A

0

B

@

P[Xm

] 0

(

nm

⇥nf

)

0

(

nf

⇥nm

)

P[

Xf

]

1

C

A

=

0

B

@

�2mP[X

m

] 0

(

nm

⇥nf

)

0

(

nf

⇥nm

)

�2fP

[

Xf

]

1

C

A

=

0

B

@

P[Xm

] 0

(

nm

⇥nf

)

0

(

nf

⇥nm

)

P[

Xf

]

1

C

A

0

B

@

�2mIn

m

0

(

nm

⇥nf

)

0

(

nf

⇥nm

)

�2fInf

1

C

A

.

= P[X]⌃.

As a final remark, remind that the Zyskind condition ensures only that OLS coefficients are

BLUE, saying nothing about the properties of the OLS standard error estimates and indeed

we have seen in the previous sections that they may be biased even if b is BLUE. The following

exercise on partitioning provides another instance of such occurrence.


Exercise 39. Consider the partitioned regression

(5.4.1) y = X1�1 +X2�2 + ",

maintaining LRM.1-LRM.4. 1) Verify that premultiplying both sides of the foregoing equation

by M[X2] boils down to the reduced regression model

(5.4.2) ˜

y =

˜X1�1 + ˜"

where

˜

y = M[X2]y,˜X1 = M[X2]X1 and ˜" = M[X2]".

2) How can you interpret the variables in model (5.4.2)? 3) As far as �1 is concerned, does

OLS applied to model (5.4.2) yields the same estimator as OLS applied to model (5.4.1), why

or why not? 4) Does the reduced model (5.4.2) satisfy LRM.1-LRM.4? Which ones, if any, are

not satisfied? 5) The degrees of freedom of the reduced regression are n � k1. Do you think

that the resulting OLS estimate for �2 would be unbiased? 6) Verify that the reduced model

(5.4.2) satisfies the Zyskind condition.

Solution. 1) It does since M[X2]X2 = 0. 2) The variables ˜

y and ˜X1 in model (5.4.2) are

the residuals from k1 + 1 separate regressions using y and each column of X1 as dependent

variables and the columns of X2 as regressors. 3) Yes, by the FWL Theorem. 4) LRM.1-LRM.3

are met, but LRM.4 fails with

V ar⇣

˜"| ˜X1

⌘

= �2M[X2].

5) It is biased since we know that the unbiased OLS estimator uses n� k degrees of freedom

to correct the OLS residual sum of squares (which is nonetheless the same for both models

(5.4.1) and (5.4.2), as we learn from Section 4.9). 6) You have just to verify that

M[X2]P[

M[X2]X1]

= P[

M[X2]X1]

M[X2],


which is readily done by noting that M[X2], symmetric and idempotent, is the first and the

last factor in P[

M[X2]X1]

.

The within regression examined in Chapter 8 (equation (8.2.7)) is a special case of model

(5.4.2) in the foregoing exercise.

CHAPTER 6

Tests for structural change

6.1. The Chow predictive test

There is a time series data-set with T = T1 + T2 observations. X denotes the (T ⇥ k)

regressor matrix of full column rank, y is the (T ⇥ 1) vector of observations for the dependent

variable and " is the error vector, which is normally distributed, given X, with zero mean and

constant variance �2. Assume also that T1 > k. It is worth noting right now that we are not

assuming T2 > k, so that the test we introduce here, differently from the Classical Chow test,

goes through also in the case in which the second subsample does not permit identification of

the (k ⇥ 1) vector of population coefficients �.

Partition y, X and " row-wise

y =

0

@

y1

y2

1

A , X =

0

@

X1

X2

1

A and " =

0

@

"1

"2

1

A ,

so that the top blocks of the partitioned matrices contain the first T1 observations and those

in the bottom contain the last T2 observations. The model under the null hypothesis is just

the classical regression model

(6.1.1) y = X� + ".

and the OLS estimator for � is

b⇤ =�

X 0X��1

X 0y.

The structural break is thought of as time-specific additive shocks hitting the model over the

second subsample. Therefore, the model in the presence of the structural break is formulated

81

6.1. THE CHOW PREDICTIVE TEST 82

as

y1 = X1� + "1

y2 = X2� + IT2� + "2,(6.1.2)

where IT2 is the identity matrix of order T2 and � is the (T2 ⇥ 1) vector of time-specific

shocks. Written in a more compact form where 0(l⇥m) indicates a (l ⇥m) null matrix, the

general model becomes0

@

y1

y2

1

A

=

0

@

X1 0(T1⇥T2)

X2 IT2

1

A

0

@

�

�

1

A

+

0

@

"1

"2

1

A ,

from which it is clear that model (6.1.2) uses a larger regressor matrix than model (6.1.1),

including also the time dummies specific to the observations over the second subsample

D =

0

@

0(T1⇥T2)

IT2

1

A

and that the time specific shocks � are the coefficients on those time dummies. The OLS

estimator for � and � can then be expressed as0

@

b

c

1

A

=

�

W 0W��1

W 0y,

where W = (X D) is the extended regressor matrix. Therefore, the null hypothesis can be

formally expressed as H0 : � = 0(T2⇥1) and can be tested by a standard F-test of joint

significance

(6.1.3) F =

(e

0⇤e⇤ � e

0e) /T2

e

0e/ (T1 � k)

⇠ F (T2, T1 � k) ,


where e⇤ = y �Xb⇤ indicate the OLS residual (T ⇥ 1) vector from the model under the null

and

(6.1.4) e = y �W

0

@

b

c

1

A

is the OLS residual (T ⇥ 1) vector from the unrestricted model1.

It is possible to prove that the F-ratio in (6.1.3) equals

(6.1.5) F =

(e

0⇤e⇤ � e

01e1) /T2

e

01e1/ (T1 � k)

,

where e1 = y1 �X1b1 denotes the OLS residual (T1 ⇥ 1) vector obtained by regressing y1 on

X1 and accordingly

(6.1.6) b1 =�

X 01X1

��1X1y1.

Differently from Greene (2008) I prove this result by using the FWL Theorem. We just need

to work out e and to do this I obtain the separate expressions for b and c. Let’s get started

with b. By the FWL Theorem

b =

�

X 0M[D]X��1

X 0M[D]y.

Expand M[D] to obtain

M[D] = In �

0

@

0(T1⇥T2)

IT2

1

A

2

6

4

0

@

0(T1⇥T2)

IT2

1

A

00

@

0(T1⇥T2)

IT2

1

A

3

7

5

�10

@

0(T1⇥T2)

IT2

1

A

0

.

The matrix in brackets in the foregoing expression reduces to

0(T2⇥T2) + IT2 = IT2 .

1The degrees-of-freedom correction in the denominator of the F ratio follows from the fact that the number of

estimated parameters under the alternative is k + T2.


Therefore,

M[D] = In �

0

@

0(T1⇥T2)

IT2

1

A

0

@

0(T1⇥T2)

IT2

1

A

0

= In �

0

@

0(T1⇥T1) 0(T1⇥T2)

0(T2⇥T1) IT2

1

A .(6.1.7)

The second matrix in equation (6.1.7) transforms any conformable vector it premultiplies

so that its first T1 values are replaced by zeroes and its last T2 values are left unchanged.

Accordingly, M[D] carries out the complement operation, transforming any conformable vector

to which it is premultiplied in a way that its first T1 values remain unchanged and the last T2

values get replaced by zeroes. Therefore,

M[D]X =

0

@

X1

0(T2⇥k)

1

A ,

implying that b = b1, defined in equation (6.1.6).

Turning to c, by the FWL Theorem (equation (3.6.3))

c =

�

D0D��1

D0(y �Xb)

and since b = b1 and D0D = IT2 ,

c =

0

@

0(T1⇥T2)

IT2

1

A

0

(y �Xb1)

=

0

@

0(T1⇥T2)

IT2

1

A

00

@

y1 �X1b1

y2 �X2b1

1

A

= y2 �X2b1.(6.1.8)

6.2. AN EQUIVALENT REFORMULATION OF THE CHOW PREDICTIVE TEST 85

Therefore, the OLS coefficients c are the predicted residuals for the second subsample by

means of the estimates b1, obtained from the first subsample. Finally, replacing the right

hand sides of equations (6.1.6) and (6.1.8) into equation (6.1.4) yields

e = y �Xb1 �D (y2 �X2b1)

=

0

@

y1

y2

1

A�

0

@

X1b1

X2b1

1

A�

0

@

0(T1⇥T2)

y2 �X2b1

1

A

=

0

@

e1

0(T2⇥1)

1

A ,

which proves that e

0e = e

01e1 and consequently the F test expression of equation (6.1.5).

Remark 40. Since nothing of the foregoing derivation hinges upon the fact that the T2

observations are contiguous in the sample (the OLS estimator and residuals are invariant to

permutations of the rows), there is a more general lesson that can be learnt here. Regardless of

the data-set format, be it a time series, a cross-section or a panel data, extending the matrix of

regressors to dummy variables that indicate each a single observation will actually exclude all

involved observations from the estimation sample. Therefore, the above procedure can be used

both to test that given observations in the sample are not outliers and, in case of rejection,

to neglect the outliers from the estimation sample, without materially removing records from

the data-set.

6.2. An equivalent reformulation of the Chow predictive test

Here we stress the interpretation of the Chow predictive test as a test of zero prediction

errors in the period with insufficient observations. In doing this, we reformulate the test using

the formula based on the Wald statistic.

From equation (6.1.8) derived in the previous section we have

c = y2 �X2b1.

6.2. AN EQUIVALENT REFORMULATION OF THE CHOW PREDICTIVE TEST 86

Therefore c, the OLS estimator of �, can be thought of as the prediction error undergone

when we use the first sub-period estimates to predict y in the second sub period, given X2.

If the elements in c are not significantly different from zero jointly, then we have evidence for

not rejecting the null hypothesis of parameter constancy (zero � idiosyncratic shocks). The

Chow predictive test in (6.1.5)

F =

(e

0⇤e⇤ � e

01e1) /T2

e

01e1/ (T1 � k)

assesses exactly this in a rigorous way, since F ⇠ F (T2, T1 � k) under the null hypothesis and

normal ".

As usual, F can be rewritten as the Wald statistic divided by the number of restrictions

(T2 in this case):

(6.2.1) F =

c

0 \V ar (c|X)

�1c

T2

(c is the discrepancy vector and \V ar (c|X) is the OLS estimator of V ar (c|X) ). The F

formula in (6.2.1) can be made operational by elaborating the conditional covariance matrix

of the prediction error, V ar (c|X) , under the null hypothesis of the test, H0 : � = 0.

If � = 0, then y2 = X2� + "2 and b1 = � + (X 01X1)

�1X 01"1, and hence

c = X2� + "2 �X2

h

� +

�

X 01X1

��1X 0

1"1i

= "2 �X2�

X 01X1

��1X 0

1"1.

Therefore, under H0

V ar (c|X) = E

⇣

"2 �X2�

X 01X1

��1X 0

1"1⌘⇣

"2 �X2�

X 01X1

��1X 0

1"1⌘0

|X�

= E�

"2"02|X

�

+ Eh

X2�

X 01X1

��1X 0

1"1"01X1

�

X 01X1

��1X 0

2|Xi

= �2h

IT2 +X2�

X 01X1

��1X 0

2

i

,

6.3. THE CLASSICAL CHOW TEST 87

where the second equality follows from the assumption of spherical disturbances over the whole

sample,

V ar

2

4

0

@

"1

"2

1

A |X

3

5

= �2IT .

Hence,\V ar (c|X) = s21

h

IT2 +X2�

X 01X1

��1X 0

2

i

,

where s21 = e

01e1/ (T1 � k) . This implies that the predictive F test can be also computed as

F =

(y2 �X2b1)0h

IT2 +X2 (X01X1)

�1X 02

i�1(y2 �X2b1)

T2s21.

6.3. The classical Chow test

Now both T1 > k and T2 > k, so that there is enough information in both subsamples to

identify two subsample specific k⇥1 beta vectors, �1 and �2, and base a parameter constancy

test on H0 : �1 = �2.

As before, X denotes the (T ⇥ k) regressor matrix of full column rank, y is the (T ⇥ 1)

vector of observations for the dependent variable and " is the error vector, which is nor-

mally distributed, given X, with zero mean and covariance matrix �2IT . The usual row-wise

partitioning holds:

y =

0

@

y1

y2

1

A , X =

0

@

X1

X2

1

A and " =

0

@

"1

"2

1

A ,

so that the top blocks of the partitioned matrices contain the first T1 observations and those in

the bottom contain the last T2 observations. Assume that both X1 and X2 have f.c.r. Notice

that this is not ensured by f.c.r. for X as the following example show.


Example 41. The matrix

X =

0

B

B

B

B

B

B

B

B

B

B

@

1 2

1 2

1 2

1 0

1 0

1

C

C

C

C

C

C

C

C

C

C

A

has rank=2, but

X1 =

0

B

B

B

@

1 2

1 2

1 2

1

C

C

C

A

has rank=1.

The general model for the Chow test is

y1 = X1�1 + "1

y2 = X2�2 + "2

or more compactly

(6.3.1) y = W

0

@

�1

�2

1

A

+ "

where

W =

0

@

X1 0T1⇥k

0T2⇥k X2

1

A

and �1 and �2 are (k ⇥ 1) vectors.

Let

D =

0

@

0T1⇥T2

IT2

1

A .

As demonstrated in Section 6.1, a generic (T ⇥ 1) vector x, when premultiplied by M[D], is

transformed into the interaction of x and the time dummy for the first sub-period and when


premultiplied by P[D], is transformed into the interaction of x and the time dummy for the

second sub period. Hence, the general model (6.3.1) can be equivalently written as

y = M[D]X�1 + P[D]X�2 + ",

or still equivalently, given M[D] = I � P[D],

(6.3.2) y = X�1 + P[D]X� + ",

where � ⌘ �2 � �1. Therefore, the null hypothesis of the Chow test, H0 : �1 = �2, is

equivalent to the exclusion restrictions H0 : � = 0, and so the test can be implemented 1)

by constructing just one set of interacted variables P[D]X and 2) carrying out an F test of

joint significance for the coefficients on the interacted variables after OLS estimation of model

(6.3.2).

As it turns out, the classical Chow test is a special case of the predictive Chow. Consider

model (6.3.2) and reformulate it by expanding P[D]

y = X�1 +D�

D0D��1

D0X� + "

or

y = X� +D� + ",

where � = �1 and � = (D0D)

�1D0X�. But since (D0D)

�1D0X = X2, then � = X2�, which

shows that � 2 R (X2) ✓ RT2 and so that the �’s here are not arbitrary as in the predictive

Chow test, where � 2 RT2 . This implies that, given rank (X2) = k, the two tests are identically

equal if and only if T2 = k, only in this case in fact R (X2) = RT2 .

CHAPTER 7

Large sample results for OLS and GLS estimators

7.1. Introduction

The linear regression model may present departures from LRM.4, such as heteroskedastic-

ity and/or cluster correlation. In this chapter we study common econometric techniques that

accommodate these issues, for both estimation and inference: primarily, the Generalized LS

(GLS) estimator for the regression coefficients and robust covariance estimators.

All known statistical properties are derived for n ! 1and so the techniques we consider

in this chapter work well in “large samples”.

I spell out the assumptions needed for consistency and asymptotic normality of OLS and

GLS estimators, providing the derivation of the large-sample properties.

Strict exogeneity is maintained throughout:

SE: E ("|X) = 0.

A weaker version of the random sampling assumption, one which does not maintain identical

distributions of records, is invoked when proving asymptotic normality and consistency for the

variance estimators:

RS: There is a sample of size n, such that the elements of the sequence {(yi x0i) , i = 1, ..., n}

are independent (NB not necessarily identically distributed) random vectors.

Results in this chapter are demonstrated through the do-file statistics_OLS.do using the

data-sets US_gasoline.dta and mus06data.dta (from Cameron and Trivedi 2009).90

7.2. OLS WITH NON-SPHERICAL ERROR COVARIANCE MATRIX 91

7.2. OLS with non-spherical error covariance matrix

I prove consistency in the general case of V ar ("|X) = ⌃, where ⌃ is an arbitrary and

unknown symmetric, p.d. matrix.

7.2.1. Consistency. The following assumptions hold.

OLS.1: plim⇣

X0⌃Xn

⌘

= lim

n!1E⇣

X0⌃Xn

⌘

, a positive definite, finite matrix

OLS.2: plim⇣

X0Xn

⌘

= Q, a positive definite, finite matrix

The proof of consistency goes as follows.

It has

b = � +

✓

X 0X

n

◆�1 X 0"

n

then

plim (b) = � + plim

✓

X 0X

n

◆�1✓X 0"

n

◆

= � +Q�1plim

✓

X 0"

n

◆

.

By strict exogeneity

E

✓

X 0"

n

◆

= 0

then

V ar

✓

X 0"

n|X

◆

=

1

nE

✓

X 0""0X

n|X

◆

=

1

n

X 0⌃X

n

and so

V ar

✓

X 0"

n

◆

=

1

nE

✓

X 0⌃X

n

◆

,

which goes to zero as n ! 1 by assumption OLS.1. Hence X 0"/n converges in squared mean,

and consequently in probability, to zero.

Clearly, the above implies that OLS is consistent in the classical case of LRM.4.


7.2.2. Asymptotic normality with heteroskedasticity. Assumptions OLS.1 and OLS.2

hold along with RS and the following

OLS.3: V ar ("|X) = ⌃, where

⌃ =

2

6

6

6

6

6

6

6

4

�21 0 · · · 0

0 �22

. . . ...... . . . . . .

0

0 · · · 0 �2n

3

7

7

7

7

7

7

7

5

OLS.3 permits heteroskedasticity but not correlation. Partition X row-wise

X =

0

B

B

B

B

B

B

B

@

x

01

x

02

...

x

0n

1

C

C

C

C

C

C

C

A

.

By strict exogeneity E (xi"i) = 0 and hence

V ar (xi"i) = E⇥

E�

"2ixix0i|xi

�⇤

= �2iE

�

xix0i�

and

1

n

nX

i=1

V ar (xi"i) =

1

n

nX

i=1

�2iE

�

xix0i

�

= E

✓

X 0⌃X

n

◆

.

Therefore,

lim

n!1

1

n

nX

i=1

V ar (xi"i) = lim

n!1E

✓

X 0⌃X

n

◆

,

which is a finite matrix by assumption and so, by the (multivariate) Lindeberg-Feller theorem,

pn

n

nX

i=1

xi"i ⌘X 0"pn

!dN

✓

0, plim

✓

X 0⌃X

n

◆◆

.


Eventually, given the rules for limiting distributions (Theorem D.16 in Greene (2008)),

pn (b� �) ⌘

✓

X 0X

n

◆�1 X 0"pn

!dQ�1X

0"pn,

and sopn (b� �) !

dN

✓

0, Q�1plim

✓

X 0⌃X

n

◆

Q�1

◆

.

7.2.3. White’s heteroskedasticity consistent estimator for the OLS standard

errors. Under OLS.3 and given the OLS residuals ei = yi�x

0ib, i = 1, ..., n, an heteroskedas-

ticity consistent estimator for the asymptotic covariance matrix of b,

Avar (b) =1

nQ�1plim

✓

X 0⌃X

n

◆

Q�1,

is given by the White’s estimator:

(7.2.1) \Avar (b) =�

X 0X��1

X 0ˆ

⌃X�

X 0X��1

,

where

ˆ

⌃ =

2

6

6

6

6

6

6

6

4

e21 0 · · · 0

0 e22. . . ...

... . . . . . .0

0 · · · 0 e2n

3

7

7

7

7

7

7

7

5

.

An equivalent way to express ˆ

⌃, one that will be used intensively in Chapters 8 and 9, is

the following

ˆ

⌃ = ee

0 ⇤ In,

where the symbol ⇤ stands for the element-by-element matrix product (also known as Hadamard

product).

Econometric softwares routinely compute robust OLS standard errors: these are just the

square roots of the main diagonal elements of \Avar (b) in (7.2.1). In Stata this is done through

the regress option vce(robust) (or, equivalently, simply robust).


7.2.4. White’s heteroskedasticity test. The White’s estimator remains consistent un-

der homoskedasticity, therefore one can test for heteroskedasticity by assessing the statistical

discrepancy between s2 (X 0X)

�1 and (X 0X)

�1X 0ˆ

⌃X (X 0X)

�1. Under the null hypothesis

of homoskedasticity, the discrepancy will be small. This is the essence of the White’s het-

eroskedasticty test. The statistics measuring such discrepancy can be implemented through

the following auxiliary regression including the constant term.

(1) Generate the squared OLS residuals, e2 = e ⇤ e

(2) Do the OLS auxiliary regression that uses e2 as the dependent variable and the

following regressors: all k variables in the n ⇥ k sample matrixX =

⇣

X1 1

⌘

and

all interaction variables and squared variables in X1. This implies that for any two

columns of X1, say variables xi and xj , there are the additional regressors x1 ⇤ x2,

x1⇤x1 and x2⇤x2. The auxiliary regression, therefore, has p ⌘ k (k + 1) /2 regressors.

(3) Save the R-squared of the auxiliary regression, say R2a , and multiply it by the sample

size n. The resulting statistics nR2a ⇠

A�2

(p� 1) measures the statistical discrepancy

between the two covariance estimators and so provides a convenient heteroskedasticity

test: reject homoskedasticity when nR2a is larger than a conventional percentile of

choice for �2(p� 1).

We may implement the White test manually, saving the OLS residuals through predict and

then generating squares and interactions as appropriate, or more easily by giving the following

post-estimation command after regress: imtest, white.

7.2.5. Clustering. Clustering of observations along a given dimension is the norm in

microeconometric applications. For example, firms cluster across different sectors, households

live in different provinces, immigrants in a given country belong to different ethnic groups and

so on.


Clustering cannot be neglected in empirical work. In the case of firm data, for example,

it is likely that there is correlation across the productivity shocks hitting firms in the same

sectoral cluster, with a resulting bias in the standard error estimates, even if White robust.

The White estimator can be made robust to cluster correlation quite easily. I explain

this in terms of the firm data example. Assume that we have cross-sectional data of n firms,

indexed by i = 1, ..., n. There are G sectors, indexed by g = 1, ..., G and we know which sector

g = 1, ..., G firm i = 1, ..., n belongs to. This information is contained in the the n⇥G matrix

D of sectoral indicators: the element of D in row i and column j, say d (i, j), is unity if firm

i belongs to sector j and zero if not. The cluster-correlation and heteroskedasticity consistent

estimator for the asymptotic covariance matrix of b is then assembled by simply replacing ˆ

⌃

in Equation (7.2.1) with

ˆ

⌃c = ee

0 ⇤DD0.

Stata does this through the regress option vce(cluster clustervar), where clustervar is

the name of the cluster identifier in the data set.

Chapter 9 will cover cases of multi-clustering, that is data that are grouped along more

than one dimension.

7.2.6. Average variance estimate (skip it). I prove now that a consistent estimate of

the average variance

�2n =

1

n

nX

i=1

�2i ,

is given by

s2n =

1

n

nX

i=1

e2i ,

in the sense that plim�

s2n � �2n

�

= 0 (NB I use this formulation and not plim�

s2n�

= �2n as �2

n

is a sequence).

Since

s2n =

"0"

n�✓

"0X

n

◆✓

X 0X

n

◆�1✓X 0"

n

◆

,

7.3. GLS 96

plim�

s2n�

= plim

✓

"0"

n

◆

+ 0

0Q�10

= plim

✓

"0"

n

◆

.

By the RS assumption the squared errors, "2i , are all independently distributed with means

E�

"2i�

⌘ �2i , and given that

"0"

n=

1

n

nX

i=1

"2i ,

I can apply the Markov’s strong law of large numbers to have

plim

"

✓

"0"

n

◆

� 1

n

nX

i=1

�2i

#

= 0.

7.3. GLS

The estimation strategy described in the previous sections is based on OLS estimates for the

regression coefficients with standard errors estimates corrected for heteroskedasticity and/or

cluster correlation. The drawback of the approach is a loss in efficiency, if the departures from

LRM.4 are of a known form. We will see that in this case the BLUE can always be found.

To formalize the new set-up, let V ar ("|X) ⌘ ⌃ = �2⌦, where ⌦ is a known symmetric,

positive definite (p.d.) (n⇥ n) matrix and �2 is an unknown strictly positive scalar (that is,

⌃ is known up to a strictly positive multiplicative scalar.)

Since ⌦ is symmetric and p.d., it can be always factorized as ⌦ = C⇤C 0 where ⇤ is

an (n⇥ n) diagonal matrix with the main diagonal elements all strictly positive and C is a

(n⇥ n) matrix such that C 0C = I, implying that C 0 is the inverse of C and consequently that

CC 0= I (⇤ and C are said, respectively, the eigenvalues (or also characteristic roots) and the

eigenvectors (or also characteristic vectors) matrices of ⌦).

A great benefit of the foregoing factorization is that it permits to compute the inverse of

⌦ effortlessly. In fact, it is possible to verify that

(7.3.1) ⌦

�1= C⇤

�1C 0

7.3. GLS 97

and

(7.3.2) ⌦

�1/2= C⇤

�1/2C 0,

where ⌦

�1 is the inverse of ⌦, ⌦

�1/2⌦

�1/2= ⌦

�1, ⇤

�1 is the inverse of ⇤ and ⇤

�1/2 is

a diagonal matrix with main-diagonal elements equal to the square root reciprocals of the

main-diagonal elements of ⇤.

Consider the GLS transformed model

(7.3.3) y

⇤= X⇤� + "⇤,

such that y

⇤ ⌘ ⇤

�1/2C 0y, X⇤ ⌘ ⇤

�1/2C 0X and "⇤ ⌘ ⇤

�1/2C 0".

Exercise 42. Verify by direct inspection that indeed ⌦

�1⌦ = ⌦⌦

�1= ⌦ and ⌦

�1/2⌦

�1/2=

⌦

�1.

Solution. Remember that ⇤ is diagonal and so ⇤

�1⇤ = ⇤⇤

�1= I. Then,

⌦

�1⌦ = C⇤

�1C 0C⇤C 0= C⇤

�1I⇤C 0= C⇤

�1⇤C 0

= CC 0= I

and

⌦⌦

�1= C⇤C 0C⇤

�1C 0= C⇤I⇤�1C 0

= C⇤⇤

�1C 0= CC 0

= I.

The rest is proved similarly on considering that ⇤

�1/2 is diagonal and so ⇤

�1/2⇤

�1/2= ⇤

�1.

Exercise 43. Use (7.3.1) and (7.3.2) to prove 1) X⇤0X⇤= X 0

⌦

�1X; 2) X⇤0"⇤ = X 0⌦

�1";

and 3) V ar ("⇤|X) = �2In then use the general law of iterated expectation to prove that also

V ar ("⇤|X⇤) = �2In

Given the results of the foregoing exercise, the OLS applied to the transformed model

(7.3.3) is the Gauss-Marcov estimator for � and has the formula

bGLS =

⇣

X⇤0X⇤⌘�1

X⇤0y

⇤

=

�

X 0⌦

�1X��1

X 0⌦

�1y(7.3.4)

7.3. GLS 98

with V ar (bGLS |X) = �2�

X 0⌦

�1X��1

.

Exercise 44. Let ⌦

�1/2= C⇤

�1/2C 0 and prove that

(7.3.5) ⌦

�1/2y = ⌦

�1/2X� + ⌦

�1/2"

is also a GLS transformation, that is OLS applied to model (7.3.5) yields bGLS .

Solution: By exercise 42

⇣

X 0⌦

�1/2⌦

�1/2X⌘�1

X 0⌦

�1/2⌦

�1/2y =

�

X 0⌦

�1X��1

X 0⌦

�1y.

The estimator bGLS is OLS applied to a classical regression model and as such it is BLUE.

The following exercise asks you to verify by direct inspection that GLS is “better than” OLS

in terms of covariance.


�2�

X 0X��1

X 0⌦X

�

X 0X��1 � �2

�

X 0⌦

�1X��1

is a n.n.d. matrix.

Solution: We define a k ⇥ n matrix D as

D ⌘�

X 0X��1

X 0 ��

X 0⌦

�1X��1

X 0⌦

�1.

Therefore,�

X 0X��1

X 0=

�

X 0⌦

�1X��1

X 0⌦

�1+D.

On noticing that DX = 0k⇥k,

�

X 0X��1

X 0⌦X

�

X 0X��1

=

h

�

X 0⌦

�1X��1

X 0⌦

�1+D

i

⌦

h

⌦

�1X�

X 0⌦

�1X��1

+D0i

=

�

X 0⌦

�1X��1

+D⌦D0.

7.3. GLS 99

Since ⌦ is p.d., then for any n⇥ 1 vector z, z0⌦z � 0, being equal to zero if and only if z = 0.

But then, z0⌦z � 0 when, in particular, z = D0w for any n⇥ 1 vector w, which is equivalent

to saying that w

0D⌦D0w � 0 for any n ⇥ 1 vector w, or that D⌦D0 is n.n.d, proving the

result.

7.3.1. Consistency of GLS. The following assumption hold.

GLS.1: plim⇣

X0⌦�1Xn

⌘

= lim

n!1E⇣

X0⌦�1Xn

⌘

= Q, a positive definite, finite matrix

Exercise 46. Given that

bGLS = � +

✓

X 0⌦

�1X

n

◆�1X 0

⌦

�1"

n,

prove that

plim (bGLS) = �

under assumption GLS.1 and strict exogeneity (SE).

Solution. Easy: just write

bGLS = � +

X⇤0X⇤

n

!�1X⇤0"⇤

n,

then consider that V ar ("⇤|X⇤) = �2In (see Exercise 43) and, finally, follow the same steps as

in Section 7.2.1.

7.3.2. Asymptotic normality. I prove asymptotic normality for bGLS under GLS.1,

SE and RS (again, remember that V ar ("⇤|X⇤) = �2In).

By strict exogeneity E (x

⇤i "

⇤i ) = 0 and hence

V ar (x⇤i "

⇤i ) = �2E

⇣

x

⇤ix

⇤0i

⌘

7.3. GLS 100

and

1

n

nX

i=1

V ar (x⇤i "

⇤i ) =

1

n�2

nX

i=1

E⇣

x

⇤ix

⇤0i

⌘

= �2E

✓

X 0⌦

�1X

n

◆

.

Therefore,

lim

n!1

1

n

nX

i=1

V ar (x⇤i "

⇤i ) = �2

lim

n!1E

✓

X 0⌦

�1X

n

◆

,

which is a finite matrix by assumption. By the Lindeberg-Feller central limit theorem,pn

n

nX

i=1

x

⇤i "

⇤i ⌘

X⇤0"⇤pn

⌘ X 0⌦

�1"pn

!dN�

0,�2Q�

and sincepn (bGLS � �) ⌘

✓

X 0⌦

�1X

n

◆�1X 0

⌦

�1"pn

!dQ�1X

0⌦

�1"pn

,

thenpn (bGLS � �) !

dN�

0,�2Q�1�

.

The asymptotic covariance matrix of bGLS is therefore

Avar (bGLS) =�2

nQ�1,

and is estimated by\Avar (bGLS) = s2GLS

�

X 0⌦

�1X��1

where

s2GLS =

(y⇤ �X⇤bGLS)0(y⇤ �X⇤bGLS)

n� k.

=

(y �XbGLS)0⌦

�1(y �XbGLS)

n� k.

7.3. GLS 101

Exercise 47. (This may be skipped) Under GLS.1, SE and RS, prove that plim�

s2GLS

�

=

�2.

7.3.3. Feasible GLS. In general situations we may know the form of ⌦ but not the

values taken on by its elements. Therefore to make GLS operational we need an estimate of

⌦, say b

⌦. Replacing ⌦ by b

⌦ into (7.3.4) delivers the feasible GLS, henceforth FGLS:

bFGLS =

⇣

X 0b

⌦

�1X⌘�1

X 0b

⌦

�1y.

Since GLS is consistent, to know that bGLS and bFGLS are asymptotically equivalent, i.e.

plim (bFGLS � bGLS) = 0, is enough to ensure that bFGLS is consistent but not that

pn (bFGLS � �) �!

dN�

0,�2Q�1�

.

For this we need the stronger condition thatpn (bFGLS � �) and

pn (bFGLS � �) be asymp-

totically equivalent, or

(7.3.6)pn (bFGLS � bGLS) �!

p0.

Two sufficient conditions for (7.3.6) are the following

plim

X 0b

⌦

�1"pn

� X 0⌦

�1"pn

!

= 0(7.3.7)

plim

X 0b

⌦

�1X

n� X 0

⌦

�1X

n

!

= 0.(7.3.8)

Exercise 48. Use condition GLS.1 that

plim

✓

X 0⌦

�1X

n

◆

= Q,

the similar condition that

plim

✓

X 0⌦

�1"pn

◆

= q0,

7.3. GLS 102

where q0 is a finite vector, and the Slutsky Theorem (if g is a continuous function, then

plim [g (z)] = g [plim (z)], p. 1113) to verify that

plim

X 0b

⌦

�1"pn

� X 0⌦

�1"pn

!

= 0

plim

X 0b

⌦

�1X

n� X 0

⌦

�1X

n

!

= 0

are sufficient for (7.3.6).

Solution: Given

plim

✓

X 0⌦

�1X

n

◆

= Q

and

plim

X 0b

⌦

�1X

n� X 0

⌦

�1X

n

!

= 0,

then

plim

✓

X 0⌦

�1X

n

◆

+ plim

X 0b

⌦

�1X

n� X 0

⌦

�1X

n

!

= Q,

and so applying Slutsky Theorem twice, we get

plim

X 0⌦

�1X

n+

X 0b

⌦

�1X

n� X 0

⌦

�1X

n

!

= plim

X 0b

⌦

�1X

n

!

= Q

and

(7.3.9) plim

X 0b

⌦

�1X

n

!�1

= Q�1.

By the same token,

(7.3.10) plim

X 0b

⌦

�1"pn

!

= q0.

But now, since

X 0b

⌦

�1X

n

!�1X 0

b

⌦

�1"pn

=

pn⇣

X 0b

⌦

�1X⌘�1

X 0b

⌦

�1",

7.4. LARGE SAMPLE TESTS 103

✓

X 0⌦

�1X

n

◆�1X 0

⌦

�1"pn

=

pn�

X 0⌦

�1X��1

X 0⌦

�1",

bFGLS � � =

⇣

X 0b

⌦

�1X⌘�1

X 0b

⌦

�1" and bGLS � � =

�

X 0⌦

�1X��1

X 0⌦

�1", then

X 0b

⌦

�1X

n

!�1X 0

b

⌦

�1"pn

=

pn (bFGLS � �) ,

✓

X 0⌦

�1X

n

◆�1X 0

⌦

�1"pn

=

pn (bGLS � �) .

The last two equalities, along with the maintained conditions (7.3.7) and (7.3.8), the asymp-

totic results (7.3.9) and (7.3.10) and the Slutsky Theorem, prove that bothpn (bGLS � �)

andpn (bFGLS � �) converge in probability to the same limit, Q�1

q0.

Conditions (7.3.7) and (7.3.8) must be verified on a case-by-case basis. Importantly, they

may hold even in cases in which b

⌦ is not consistent, as shown in the context of FGLS panel

data estimators by Prucha (1984).

7.4. Large sample tests

7.4.1. Introduction. This section covers large sample tests in more detail than Greene

(2008). For the exam you can skip the derivations of the asymptotic results.

Assume the following results hold

(1)pn (b� �) �!

dN�

0,�2Q�1�

(2) plim⇣

X0Xn

⌘

= Q

(3) plim�

s2�

= �2.

and consider the following lemma, referred to as the product rule. For more on this see

White (2001) p. 67 (notice that the product rule is not mentioned in Greene (2008), although

implicitly used for proving the asymptotic distributions of the tests).


Lemma 49. (The product rule) Let An be a sequence of random (l ⇥ k) matrices and bn a

sequence of random (k ⇥ 1) vectors such that plim (An) = 0 and bn !dz. Then, plim (Anbn) =

0.

7.4.2. The t-ratio test (skip derivations). We wish to derive the asymptotic distri-

bution of the t-ratio test for the null hypothesis Ho : �k = �ok. We begin by noting that under

Ho

(7.4.1)pn (bk � �o

k)q

�2Q�1kk

�!d

N (0, 1)

by result 1 and Theorem D.16(2) in Greene (2008) (p. 1050).

Then, the t-ratio test for Ho is

t =(bk � �o

k)q

s2 (X 0X)

�1kk

,

where (X 0X)

�1kk ⌘

⇣

x

0kM

[

X(k)]xk

⌘�1and X =

�

X(k) xk

�

(see Section 4.8). Since t can be

reformulated as

t =

pn (bk � �o

k)q

s2�

X0Xn

��1

kk

,

then

plim

2

4

pn (bk � �o

k)q

s2�

X0Xn

��1

kk

�pn (bk � �o

k)q

�2Q�1kk

3

5

=(7.4.2)

plim

8

<

:

0

@

1

q

s2�

X0Xn

��1

kk

� 1

q

�2Q�1kk

1

A

pn (bk � �o

k)

9

=

;

= 0

where the second equality follows from the product rule, given that, by results 2-3 and the

Slutsky Theorem (Theorem D.12 in Greene (2008), p. 1045), the first factor in the second plim

converges in probability to zero and, by result 1., the second factor converges in distribution

to a normal random scalar. Hence, the two sequences in the plim of equation (7.4.2) are


asymptotically equivalent and by Theorem D.16(3) have the same limiting distribution. Given

(7.4.1), this proves that(bk � �o

k)q

s2 (X 0X)

�1kk

�!d

N (0, 1) .

Consider, now, the general case of a null hypothesis Ho : r

0� = q, where r is a non-

zero (k ⇥ 1) vector of non-random constants and q is a non-random scalar. Using the same

approach as above it is possible to prove that

(7.4.3)r

0(b� �)

q

s2r0 (X 0X)

�1r

�!d

N (0, 1) .

Exercise 50. (skip) Prove (7.4.3). Hint: by the Slutsky Theorem, plim

r

0⇣

X0Xn

⌘�1r

�

=

r

0Q�1r.

7.4.3. The Chi-squared test (skip derivations). We wish to test the null hypothesis

Ho : R��q = 0, where R is a non-random (J ⇥ k) matrix of full-row rank and q is a (J ⇥ 1)

column vector. Under Ho, R� = q and so the F test can be written as

F =

(b� �)0R0h

s2R (X 0X)

�1R0i�1

R (b� �)

J.

The foregoing equation can be rearranged as

JF =

pn (b� �)0R0

"

s2R

✓

X 0X

n

◆�1

R0

#�1

Rpn (b� �) .

Now let A ⌘ �2RQ�1R0. Since A is p.d. (R is f.r.r.), there exists a p.d. matrix A1/2 such

that A1/2A1/2= A and A�1/2

=

�

A1/2��1. Then, by result 1. and the Slutsky Theorem,

(7.4.4) A�1/2Rpn (b� �) �!

dN (0, IJ) .

Similarly, let Â ⌘ s2R (X 0X/n)�1R0. Since Â is p.d., there exists a p.d. matrix Â1/2 such

that Â1/2Â1/2

=

Â and Â�1/2=

⇣

Â1/2⌘�1

. Then


plimh

Â�1/2Rpn (b� �)�A�1/2R

pn (b� �)

i

=(7.4.5)

plimh⇣

Â�1/2 �A�1/2⌘

Rpn (b� �)

i

= 0

where the second equality follows from the product rule given that

plim⇣

Â�1/2⌘

= A�1/2,

by results 2. and 3. and the Slutsky Theorem, and

Rpn (b� �) �!

dN (0, A) ,

by result 1. and Theorem D.16(2) in Greene (2008). Hence, by Theorem D.16(3) the two

sequences in the left-hand-side plim of equation (7.4.5) have the same limiting distribution

and given (7.4.4), this proves that

Â�1/2Rpn (b� �) �!

dN (0, IJ) .

Let w ⌘ Â�1/2Rpn (b� �) , then by Theorem D.16(2),

(7.4.6) w

0w �!

d�2

(J) .

But since

Â�1/2Â�1/2

=

Â�1=

"

s2R

✓

X 0X

n

◆�1

R0

#�1

,

then

w

0w =

pn (b� �)0R0

Â�1/2Â�1/2R

pn (b� �) =

pn (b� �)0R0

"

s2R

✓

X 0X

n

◆�1

R0

#�1

Rpn (b� �) = JF,


and so by (7.4.6)

JF �!d

�2(J) .

CHAPTER 8

Fixed and Random Effects Panel Data Models

8.1. Introduction

This chapter covers the two most important panel data models: the fixed effect and the

random effect models.

For simplicity we start directly from the statistical models. The sampling mechanism will

be introduced when proving asymptotic normality.

Results in this chapter are demonstrated through the do-file paneldata.do using the data-

set airlines.dta, a panel data that I have extracted from costfn.dta (Baltagi et al. 1998).

8.2. The Fixed Effect Model (or Least Squares Dummy Variables Model)

Consider the following panel data regression model expressed at the observation level, that

is for individual i = 1, ...N and time t = 1, ..., T :

(8.2.1) yit = x

0it� + ↵i + "it

where x

0it = (x1

it

, ..., xkit

),

� =

0

B

B

B

@

�1...

�k

1

C

C

C

A

and ↵i is a scalar denoting the time-constant, individual specific effect for the individual i.108

8.2. THE FIXED EFFECT MODEL (OR LEAST SQUARES DUMMY VARIABLES MODEL) 109

Define djit

as the value taken on by the dummy variable indicating individual j = 1, ..., N

at the observation (i, t) , that is

djit

=

8

<

:

1 if i = j, any t = 1, ..., T

0 if i 6= j, any t = 1, ..., T.

Then, model (8.2.1) can be equivalently written as

(8.2.2) yit = x

0it� + d1

it

↵1 + ...+ diit

↵i + ...+ dNit

↵N + "it,

i = 1, ...N and t = 1, ..., T.

In a more compact notation, at the individual level i = 1, ..., N, model (8.2.2) is written

as

yi = Xi� + d1i

↵1 + ...+ dii

↵i + ...+ dNi

↵N + "i,

where

yi(T⇥1)

=

0

B

B

B

B

B

B

B

B

B

B

@

yi1...

yit...

yiT

1

C

C

C

C

C

C

C

C

C

C

A

, Xi(T⇥k)

=

0

B

B

B

B

B

B

B

B

B

B

@

x

0i1...

x

0it...

x

0iT

1

C

C

C

C

C

C

C

C

C

C

A

, "i(T⇥1)

=

0

B

B

B

B

B

B

B

B

B

B

@

"i1...

"it...

"iT

1

C

C

C

C

C

C

C

C

C

C

A

and

dji

=

8

<

:

1T if i = j

0T if i 6= j,

1T indicates the (T ⇥ 1) vector of all unity elements and 0T the (T ⇥ 1) vector of all zero

elements.

Stacking data by individuals, an even more compact representation of the regression model

(8.2.2), at the level of the whole data-set, is

(8.2.3) y = X� +D↵+ ",


where

y

(NT⇥1)=

0

B

B

B

B

B

B

B

B

B

B

@

y1

...

yi

...

yN

1

C

C

C

C

C

C

C

C

C

C

A

, X(NT⇥k)

=

0

B

B

B

B

B

B

B

B

B

B

@

X1

...

Xi

...

XN

1

C

C

C

C

C

C

C

C

C

C

A

, "(NT⇥1)

=

0

B

B

B

B

B

B

B

B

B

B

@

"1...

"i...

"N

1

C

C

C

C

C

C

C

C

C

C

A

, ↵(N⇥1)

=

0

B

B

B

B

B

B

B

B

B

B

@

↵1

...

↵i

...

↵N

1

C

C

C

C

C

C

C

C

C

C

A

and D is the (NT ⇥N) matrix of dummy variables

D =

0

B

B

B

B

B

B

B

B

B

B

@

1T 0T · · · 0T

0T 1T · · · 0T

......

...

0T 0T · · · 0T

0T 0T · · · 1T

1

C

C

C

C

C

C

C

C

C

C

A

.

or equivalently D = (d1 d2... dN ) . Under the following assumptions model (8.2.3) is a classical

regression model that includes individual dummies:

FE.1: The extended regressor matrix (X D) has f.c.r. Therefore, not only is X of f.c.r.,

but also none of its columns can be expressed as a linear combination of the dummy

variables, which boils down to saying that no column of X can be time-constant,

which in turn implies that X does not include the unity vector (indeed, there is a

constant term in model (8.2.3), but one that jumps across individuals).

FE.2: E ("|X) = 0. Hence, the variables in X are strictly exogenous with respect to ",

but the statistical relationship with ↵ is left completely unrestricted. Model (8.2.3),

therefore, automatically accommodates any form of omitted-variable bias due to the

omission of time constant regressors. Notice that D is taken as a non-random matrix,

therefore conditioning on (X D) or simply X is exactly the same.


FE.3: V ar ("|X) = �2"INT . This is standard. It can be relaxed in more advanced

treatment of the topic, as in Arellano (2003) for example, but see also Section 8.7

(and Chapter 9, which can however be skipped for the exam).

Exercise 51. Prove that the following model with the constant term is an equivalent

reparametrization of Model (8.2.3):

y = 1NT ↵0 +X� +D�1 ˜↵�1 + ",(8.2.4)

where D�1 = (d2... dN ), ˜↵�1 ⌘ ↵�1�1N�1↵1, ↵�1 = (↵2...↵N )

0, 1s denotes the s⇥ 1 vector

of all unity elements and ↵0 ⌘ ↵1.

Solution. Partition D = (d1 D�1) and

↵ =

0

@

↵1

↵�1

1

A

Then, rewrite model (8.2.3) equivalently as

(8.2.5) y = d1↵1 +X� +D�1↵�1 + "

Since D1N = 1NT , then (d1 D�1)1N = 1NT or equivalently

d1 +D�11N�1 = 1NT .

Therefore, we can reparametrize model (8.2.5) adding 1NT↵1 and subtracting (d1 +D�11N�1)↵1

to the right hand side of (8.2.5) to have

y = 1NT↵1 + d1↵1 +X� +D�1↵�1 � (d1 +D�11N�1)↵1 + "

= 1NT↵1 +X� +D�1↵�1 �D�11N�1↵1 + "

= 1NT ↵0 +X� +D�1 ˜↵�1 + ".

where ↵0 ⌘ ↵1 and ˜↵�1 ⌘ ↵�1 � 1N�1↵1.


Remark 52. Exercise 51 demonstrates that after the reparametrization the interpretation

of the � coefficients is unchanged, the constant term is ↵1 and the coefficients on the remaining

individual dummies are no longer the individual effects of the remaining individuals, ↵i, i =

2, ..., N , but rather the contrasts of ↵i with respect to ↵1, i = 2, ..., N . Of course, the reference

individual must not necessarily be the first one in the data-set and can be freely chosen among

the N individuals by the researcher at her/his own convenience. In Stata this is implemented

by using regress followed by the dependent variable, the X regressors and N � 1 dummy

variables (see paneldata.do).

Remark 53. The interpretation of the constant in Exercise 51 is different from that in

the Stata transformation (see 10/04/12 Exercises) of Model (8.2.3). In the former case the

constant term is the effect of the individual whose dummy is removed from the regression, in

the latter it is the average of the N individual effects.

The LSDV estimator is just the OLS estimator applied to model (8.2.3) and, given FE.1-3,

it is the BLUE. The separate formulas of LSDV for � and ↵ are obtained by applying the

FWL Theorem to (8.2.3). So,

bLSDV =

�

X 0M[D]X��1

X 0M[D]y

is the LSDV estimator for � and

aLSDV =

�

D0D��1

D0(y �XbLSDV )

is the LSDV estimator for ↵. As already mentioned, both are BLUE’s, but while bLSDV

converges in probability to � when N ! 1 or T ! 1 or both, aLSDV converges in probability

to ↵ only when T ! 1. This discrepant large-sample behavior of bLSDV and aLSDV is due

to the fact that the dimension of ↵ increases as N increases, whereas that of � is kept fixed

to k.


Exercise 54. Verify that

�

D0D��1

=

0

B

B

B

B

B

B

B

@

1/T 0 · · · 0

0 1/T...

...... . . .

0

0 0 · · · 1/T

1

C

C

C

C

C

C

C

A

.

Given exercise 54, (D0D)

�1D0=

1T D

0 and so for a generic (NT ⇥ 1) vector z

�

D0D��1

D0z =

0

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

@

1T

TX

t=1

z1t

...

1T

TX

t=1

zit

...

1T

TX

t=1

zNt

1

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

A

⌘

0

B

B

B

B

B

B

B

B

B

B

@

z1...

z1...

zN

1

C

C

C

C

C

C

C

C

C

C

A

⌘ ¯

z.

In words, premultiplying any (NT ⇥ 1) vector z by (D0D)

�1D0 transforms it into an (N ⇥ 1)

vector of means, ¯z, where each mean is taken over the group of observations peculiar to the

same individual and for this reason is said a group mean. Therefore,

aLSDVi

=

¯

yi � ¯

x

0ibLSDV ,


where ¯

x

0i is the (1⇥ k) vector of group-means for individual i, ¯

x

0i =

⇣

x1i

. . . xki

⌘

. It is

also clear that for any (NT ⇥ 1) vector z

P[D]z = D�

D0D��1

D0z =

0

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

@

z1...

z1...

zi...

zi...

zN...

zN

1

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

A

.

In words, premultiplying any (NT ⇥ 1) vector z by P[D] transforms it into a sample-conformable

(NT ⇥ 1) vector of group-means: each group-mean is repeated T times. It follows that

M[D]z = z � P[D]z is the (NT ⇥ 1) vector of group-mean deviations. Therefore, one can

obtain bLSDV by applying OLS to the model transformed in group-means deviations, that is

regressing M[D]y on M[D]X.

Exercise 55. Verify (in a couple of seconds...) that P[D] =1T DD0.

The conditional variance-covariance matrix of bLSDV is V ar (bLSDV |X) = �2"

�

X 0M[D]X��1

.

It is estimated by replacing �2" with the Anova estimator s2LSDV , based on the LSDV residuals

eLSDV = M[D]y �M[D]XbLSDV :

(8.2.6) s2LSDV =

e

0LSDV eLSDV

NT �N � k.


Exercise 56. Prove that E�

s2LSDV

�

= �2" . This is a long one, but when done you can

tell yourself “BRAVO!” I just give you a few hints. First, on noting that y is determined by

the right hand side of (8.2.3) prove that e = M[

M[D]X]M[D]", then elaborate the conditional

mean of "0M[D]M[

M[D]X]M[D]" using the trace operator as we did for s2, finally apply the law

of iterated expectations.

It is not hard to verify (do it) that bLSDV can be obtained from the OLS regression of

model (8.2.3) transformed in group-means deviations (this transformation is referred to in the

panel-data literature as the within transformation)

(8.2.7) M[D]y = M[D]X� +M[D]"

The intuition is simple: since the group mean of any time constant element, as ↵i, coincides

with the element itself, all time-constant elements in model (8.2.3) are wiped out, this also

explains why X cannot contain time-constant variables. So, in a sense, the within transfor-

mation controls out the whole time-constant heterogeneity, latent or not, in model (8.2.3),

making it look like almost as a classical LRM. In particular, it can be proved easily that

LRM.1-LRM.3 hold. Notice, however, that errors in the transformed model, M[D]", have a

non-diagonal conditional covariance matrix (it is, indeed, block-diagonal and singular, can

you derive it?). Specifically, the vector M[D]" presents within-group serial correlation, since

for each individual group there are only T � 1 linearly independent demeaned errors. As a

consequence, LRM.4 does not apply to model (8.2.7). All the same, bLSDV is BLUE. This

is true because the condition of Theorem 38 in Section 5.4 is met (if you have answered the

previous question on the covariance matrix of M[D]", you should be able to verify also this

claim).

One should not conclude from the foregoing discussion that OLS on the within transformed

model (8.2.7) is a safe strategy. As in the Oaxaca’s pooled model of section 5.2, the fact that

the error covariance matrix is not spherical, presenting in this specific case within group serial

correlation, has bad consequences as far as standard error estimates are concerned. Indeed,


should we leave the econometric software free to treat model (8.2.7) as a classical LRM, and so

regress M[D]y on M[D]X, it would compute coefficient estimates just fine, but would estimate

V ar (bLSDV |X) by s2�

X 0M[D]X��1, with s2 = e

0LSDV eLSDV / (NT � k) 6= s2LSDV , which is

biased since it uses a wrong degrees of freedom correction. The econometric software cannot

be aware that for each individual in the sample there are only T � 1 linearly independent

demeaned errors and so, rather than dividing the residual sum of squares by N (T � 1)� k, it

divides it by NT �k. The upshot is that standard errors estimated in this way needs rectifying

by multiplying each of them by the correction factorp

(NT � k) / (NT �N � k).

An interesting assumption to test is that of the absence of individual heterogeneity, H0 :

↵1 = ↵2 = ... = ↵N . Under the restriction implied by H0, model (8.2.3) pools together all

data with no attention to the individual clustering and can be written as

(8.2.8) y = X⇤�⇤+ ",

where

X⇤= (1NT X) , �⇤

=

0

@

↵0

�

1

A .

Hence, under H0, the pooled OLS (POLS) estimator

(8.2.9) b

⇤POLS =

⇣

X⇤0X⇤⌘�1

X⇤y

is the BLUE. Let e⇤ indicate the restricted residual vector

(8.2.10) ePOLS = y �X⇤bPOLS ,

then under normality and H0

(8.2.11) F =

(e

0POLSePOLS � e

0LSDV eLSDV ) /N � 1

e

0LSDV eLSDV /NT �N � k

⇠ F (N � 1, NT �N � k) .

8.3. THE RANDOM EFFECT MODEL 117

If F does not reject H0, POLS is a legitimate, more efficient than LSDV, estimation procedure.

If F rejects H0, then POLS is biased and LSDV should be adopted.

Exercise 57. On reparametrizing the LSDV model as in Exercise 51, the hypothesis of no

individual heterogeneity becomes H0 : ˜↵�1 = 0. Prove that the resulting F-test is numerically

identical to F in Equation (8.2.11).

Solution. Easy. Since models (8.2.3) and (8.2.4) are indeed the same model, the result-

ing F-test is numerically identical to the F-test in Equation (8.2.11). This is demonstrated

empirically in the paneldata.do Stata dofile.

8.3. The Random Effect Model

The random effect model has the same algebraic structure of model (8.2.1). At the obser-

vation level, i = 1, ...N and t = 1, ..., T, we have

(8.3.1) yit = x

0it� + ↵i + "it

where x

0it = (x1

it

, ..., xkit

),

� =

0

B

B

B

@

�1...

�k

1

C

C

C

A

and ↵i is a scalar denoting the time-constant, individual specific effect for individual i. The

statistical properties of model (8.3.1) are different, though. Without loss of generality, write

↵i as ↵i = ↵0 + ui and let

u

(N⇥1)=

0

B

B

B

B

B

B

B

B

B

B

@

u1...

ui...

uN

1

C

C

C

C

C

C

C

C

C

C

A

.

Model (8.3.1) can then be written compactly as


(8.3.2) y = X⇤�⇤+w,

where

X⇤= (1NT X) , �⇤

=

0

@

↵0

�

1

A

and w = "+Du.

The following is maintained.

RE.1: X⇤ has f.c.r. This is equivalent to 1) X of f.c.r and 2) no linear combination of

the columns of X is equal to 1NT . Hence, as long as these two requirements are met,

X can contain time-constant variables.

RE.2: E ("|X⇤) = 0 and E (u|X⇤

) = 0. This maintains strict exogeneity of X with

respect to both components of w, and so with respect to w itself. It is a stringent

assumption, implying that the time constant variables that are not included into

the regression are unrelated to the included regressors X. Notice that since 1NT is

non-random, conditioning onX⇤ is the same as conditioning on X.

RE.3: V ar ("|X⇤) = �2

"INT , V ar (u|X⇤) = �2

uIN , Cov (",u|X⇤) = E ("u0|X⇤

) =

0

(NT⇥N).

Let ⌃ ⌘ V ar (w|X⇤) . Then, given RE.3,

⌃ = V ar ("|X⇤) + V ar (Du|X⇤

)

= �2"INT + �2

uDD.0

This means that under RE.1-3 w, although homoskedastic, is non-diagonal and the POLS

estimator in (8.2.9) is unbiased (verify this) but not BLUE (unless �2u = 0). The BLUE

estimator for �⇤ is therefore the GLS Random effect estimator

b

⇤GLS�RE =

⇣

X⇤0⌃

�1X⇤⌘�1

X⇤0⌃

�1y.


For implementation of b⇤GLS , we need to work out ⌃

�1.

Exercise 58. Verify that w is homoskedastic and in particular that V ar (wit|X⇤) =

�2" + �2

u for all i = 1, ..., N and t = 1, ..., T.

Since (see exercise 55)

P[D] =1

TDD0,

then

⌃ = �2"INT + T�2

uP[D]

= �2"INT � �2

"P[D] + �2"P[D] + T�2

uP[D]

= �2"M[D] + �2

1P[D].

where �21 = �2

" + T�2u. Therefore,

(8.3.3) ⌃

�1=

1

�2"M[D] +

1

�21

P[D]

and

b

⇤GLS�RE =

X⇤0✓

1

�2"M[D] +

1

�21

P[D]

◆

X⇤��1

X⇤✓

1

�2"M[D] +

1

�21

P[D]

◆

y.

Exercise 59. Verify that 1�2"

M[D] +1�21P[D] is indeed the inverse of ⌃, that is

�

�2"M[D] + �2

1P[D]

�

✓

1

�2"M[D] +

1

�21

P[D]

◆

= INT

(easy, if you remember the properties of M[D] and P[D].)

Exercise 60. 1) Verify that b

⇤GLS�RE can be also written as

(8.3.4) b⇤GLS�RE =

X⇤0✓

M[D] +�2"

�21

P[D]

◆

X⇤��1

X⇤0✓

M[D] +�2"

�21

P[D]

◆

y.

2) Verify that premultiplying all variables of model (8.3.2) by M[D]+�"

�1P[D] transforms it into

a classical regression model, so that b

⇤GLS�RE can be obtained at once by applying OLS to


the transformed model. 3) Verify that the operator M[D] +�"

�1P[D] can be also written as

(8.3.5) M[D] +�"�1

P[D] = INT �✓

1� �"�1

◆

P[D].

The operator in (8.3.5), M[D] + (�"/�1)P[D], transforms any conformable variable that

pre-multiplies in quasi-mean deviations, or partial deviations, in the sense that it only removes

a portion of the group-mean from the variable. For this reason, the coefficients on time-

constant variables are identified in the RE model: time-constant variables when premultiplied

by M[D]+(�"/�1)P[D] are not wiped out, but rescaled by a factor �"/�1. The RE model under

the GLS transformation is therefore

(8.3.6)⇥

M[D] + (�"/�1)P[D]

⇤

y =

⇥

M[D] + (�"/�1)P[D]

⇤

X⇤�⇤+

⇥

M[D] + (�"/�1)P[D]

⇤

w

and you may wish to verify that indeed

V ar�⇥

M[D] + (�"/�1)P[D]

⇤

w|X⇤ = �2

"INT .

8.3.1. The Feasible GLS. The feasible version of b

⇤GLS�RE , say b

⇤FGLS�RE , the one

that is actually implemented in econometric softwares, can be obtained through the method

by Swamy and Arora (1972). The estimator for �2" is simply s2LSDV in (8.2.6) and that for �2

1

is obtained as follows.

Define the Between residual vector eB as

(8.3.7) eB = P[D]y � P[D]X⇤b

⇤B

where b

⇤B =

⇣

X⇤0P[D]X⇤⌘�1

X⇤0P[D]y. In words, eB is the residual vector from the OLS

regression of the group means of y on the group means of X⇤. The resulting estimator, b⇤B,

is referred to in the panel data literature as the Between estimator1. Then, based on eB,

1Technical note: I maintain that no column of X is either time-constant or already in group-mean deviations,

so that both b

LSDV

and b

⇤B

are uniquely defined (in fact, with such an assumption X⇤0P[D]X⇤

and X 0M[D]Xare both non-singular). Indeed, this is only made for simplicity, since it is possible to prove that s2

B

and s2LSDV


construct the Anova estimator for �21 as

s2B =

e

0BeB

N � k � 1

.


E�

s2B�

= �2" + T�2

u.

Same hint as for exercise 56: First, on noting that y is determined by the right hand

side of (8.3.2) prove that eB = M[

P[D]X⇤]

P[D]w, then elaborate the conditional mean of

w

0P[D]M[

P[D]X⇤]

P[D]w using the trace operator as we did for s2, finally apply the law of

iterated expectations.

Solution. Replacing the formula of b⇤B into the right hand side of equation (8.3.7) gives

eB =

I � P[D]X⇤⇣

X⇤0P[D]X⇤⌘�1

X⇤0P[D]

�

P[D]y

=

I � P[D]X⇤⇣

X⇤0P[D]X⇤⌘�1

X⇤0P[D]

�

�

P[D]X⇤�⇤

+ P[D]w�

=

I � P[D]X⇤⇣

X⇤0P[D]X⇤⌘�1

X⇤0P[D]

�

P[D]w

= M[

P[D]X⇤]

P[D]w.

Therefore,

e

0BeB = w

0P[D]M[

P[D]X⇤]

P[D]w

= w

0M[

P[D]X⇤]

P[D]w

where the first equality follows from the idempotence of M[

P[D]X⇤]

and the second from

P[D]M[

P[D]X⇤]

= M[

P[D]X⇤]

P[D]

and the idempotence of P[D].

are uniquely defined even if b

LSDV

and b

⇤B

are not. The proof requires that all inverse matrices in the residuals

formulas are replaced with generalized inverse matrices. But don’t worry, I won’t pursue it further.


Upon noticing that e

0BeB is a scalar,

e

0BeB = tr w0M

[

P[D]X⇤]

P[D]w

= tr M[

P[D]X⇤]

P[D]ww

0,

and so

E�

e

0BeB|X⇤�

= E⇣

tr M[

P[D]X⇤]

P[D]ww

0|X⇤⌘

= tr E⇣

M[

P[D]X⇤]

P[D]ww

0|X⇤⌘

= tr M[

P[D]X⇤]

P[D]E�

ww

0|X⇤�

= tr M[

P[D]X⇤]

P[D]⌃.

Expressing ⌃ in spectral decomposition,

⌃ = �2"M[D] + �2

1P[D],

yields P[D]⌃ = �21P[D], given that P[D] is idempotent and P[D]M[D] = 0(NT⇥NT ). Hence,

E�

e

0BeB|X⇤�

= tr M[

P[D]X⇤]

P[D]�21.

= �21tr M

[

P[D]X⇤]

P[D]

Then, it remains to prove that tr M[

P[D]X⇤]

P[D] = N � k � 1. Since

M[

P[D]X⇤]

P[D] = P[D] � P[D]X⇤⇣

X⇤0P[D]X⇤⌘�1

X⇤0P[D],

tr M[

P[D]X⇤]

P[D] = tr P[D] � tr P[D]X⇤⇣

X⇤0P[D]X⇤⌘�1

X⇤0P[D]

= tr IN � tr Ik+1

= N � k � 1.

8.4. STATA IMPLEMENTATION OF STANDARD PANEL DATA ESTIMATORS 123

In conclusion, the Feasible GLS, b⇤FGLS�RE , is

b

⇤FGLS�RE =

X⇤0✓

M[D] +s2LSDV

s2BP[D]

◆

X⇤��1

X⇤✓

M[D] +s2LSDV

s2BP[D]

◆

y.


E

✓

e

0LSDV eLSDV

NT �N � k|X

◆

= �2"

(hint: follow the same steps as above, noticing that M[D]w = M[D]".)

Exercise 63. Derive the formula for the subvector of b⇤GLS�RE , say bGLS�RE , estimating

the � vector.

Solution. Simply apply the FWL Theorem to the GLS-transformed RE model in (8.3.6),

noticing that by the well-known properties of orthogonal projectors P[D]P[1NT

] = P[1NT

] (re-

member 1NT = D1N and so 1NT 2 R (D)) so that✓

M[D] +�"�1

P[D]

◆

M[1NT

]

✓

M[D] +�"�1

P[D]

◆

= M[D] +�2"

�21

�

P[D] � P[1NT

]

�

and eventually

bGLS�RE =

⇢

X 0

M[D] +�2"

�21

�

P[D] � P[1NT

]

�

�

X

��1

X 0

M[D] +�2"

�21

�

P[D] � P[1NT

]

�

�

y.

8.4. Stata implementation of standard panel data estimators

Both fixed effects and random effects estimators are implemented through the Stata com-

mand xtreg, with the usual Stata syntax for regression commands: the command is followed

by the name of the dependent variable and then the list of regressors. The noconstant option

is not admitted in this case.

As a preliminary step, however, a panel data declaration is needed to make Stata aware

of which variables in our data identify time and individuals. Suppose that in our data the

individual variable is named id and the time variable time, then the panel data declaration is

8.4. STATA IMPLEMENTATION OF STANDARD PANEL DATA ESTIMATORS 124

carried out by the instruction

xtset id time

The random effect estimator is the default of xtreg, while the fixed effects (LSDV) esti-

mator requires the option fe.

Sometimes, you may find it convenient to implement FE and RE estimators by hands,

using regress rather than xtreg. The greater computational effort may pay for the simple

reason that regress, being the most popular estimation command in Stata, is updated more

frequently to accommodate the most recent developments in statistics and econometrics, and

so has typically more options than any other estimation commands in Stata. To implement

bLSDV and aLSDV at once you may just apply regress to the LSDV model (8.2.3). This

requires to generate a full set of individual dummies from the individual-identifier id in your

panel. This is done through the tabulate command with an option, as follows

tabulate id, gen (id�)

where id_ is a name of choice. If N equals, say, 100, tabulate will add the full set of 100

individual dummies to your data, with names id_1, id_2, ..., id_100 and you can just treat

them as regressors in a regress instruction to get bLSDV as the coefficient estimates for the

X variables and aLSDV as the coefficient estimates for the id_1-id_100 variable. Degrees

of freedom are correctly calculated as NT � N � k and so no correction of standard errors

is needed. Notice that if you include all 100 dummies, then the constant term should be

removed by the noconstant option. Alternatively, you can leave it there and include N � 1

dummies. While the bLSDV estimates remain unchanged, the coefficient estimates on the

included dummies do not. The latter must now be thought of as contrasts with respect to

the constant estimate, which turns out to equal the individual effect estimate peculiar to

the individual excluded from the regression, who is therefore treated as the base individual.

Nothing is lost by choosing either identification strategy.

8.5. TESTING FIXED EFFECTS AGAINST RANDOM EFFECTS MODELS 125

When N is large the foregoing regress strategy is not practical. The bLSDV estimator

can, then, be manually implemented by applying the within transformation, carrying out OLS

on the transformed model and then correct standard errors appropriately. Implementation of

b

⇤FGLS�RE by hands is more tricky and one goes along the following steps: 1) get the two

variance components estimates from within and between regressions; 2) transform variables

(including the constant) in partial deviations and 3) apply OLS to the transformed variables.

Details can be found in a Stata do.file available on the learning space.

I recommend to use always the official xtreg command to implement the standard panel

data estimators in empirical applications, unless strictly necessary to do otherwise (for exam-

ple, if I explicitly ask you to!).

8.5. Testing fixed effects against random effects models

As Hausman (1978) and Mundlak (1978) independently found (in two papers appeared

in the same Econometrica issue!), the RE model is a special case of the FE model. In fact,

while in the former model assumption RE.2 models the relationship between the random

individual components, u, and X (E (u|X) = 0), the latter leaves it completely unrestricted.

In consequence, the RE model is nested into the FE model, so that a test discriminating

between them can be easily implemented with E (u|X) = 0 as the null hypothesis.

I present here two popular tests that, moving from the foregoing consideration, can provide

some guidance in the choice between RE and FE models.

8.5.1. The Hausman’s test. Under Ho : E (u|X) = 0, both LSDV and FGLS-RE

estimators are consistent for N ! 1, but the LSDV estimator is inefficient: redundant

individual effects are included in the regression when they could have rather been regarded

as random disturbances, saving on degrees of freedom. On the other hand, if Ho is not true

the LSDV estimator remains consistent, but FGLS does not, undergoing an omitted-variable

bias. The basic idea of the Hausman’s test (Hausman, 1978), therefore, is that under Ho the


statistical difference between the two estimators should be not significantly different from zero

in large samples.

Hausman proves that, under RE.1-RE.3, such difference can be measured by the statistics

H = (bLSDV � bFGLS�RE)0 \Avar (bLSDV � bFGLS�RE)

�1(bLSDV � bFGLS�RE)

and also that

H �!d

�2(k) .

Hausman also provides a useful computational result. He shows that since bFGLS�RE is

asymptotically efficient and bLSDV is inefficient under the null,

Acov (bLSDV � bFGLS�RE , bFGLS�RE) = 0,

so

Acov (bLSDV � bFGLS�RE , bFGLS�RE) = Acov (bLSDV , bFGLS�RE)�Avar (bFGLS�RE)

= 0

and

Avar (bLSDV � bFGLS�RE) = Avar (bLSDV )�Avar (bFGLS�RE) .

Hence,

H = (bLSDV � bFGLS�RE)0h

\Avar (bLSDV )� \Avar (bFGLS�RE)

i�1(bLSDV � bFGLS�RE) .

Wooldridge (2010) (pp. 328-334) evidences two difficulties with the Hausman’s test.

First, Avar (bLSDV ) � Avar (bFGLS�RE) is singular if X includes aggregate variables,

such as time dummies. Therefore, along with the coefficients on time-constant variables, also

those on aggregate variables must be excluded from the Hausman statistics.


Second, and more importantly, if RE.3 fails, then, on the one hand, the asymptotic dis-

tribution of H is not standard even if RE.2 holds, so that H would be of little guidance in

detecting violations of RE.2, with an actual size that may be significantly different from the

nominal size. On the other hand, H is designed to detect violations of RE.2 and not RE.3.

In fact, if RE.2 holds both LSDV and FGLS-RE are consistent, regardless of RE.3, and H

converges in distribution rather then diverging, which means that the probability of rejecting

RE.3 when it is false does not tend to unity as N ! 1, making H inconsistent. The solution

is so to consider H as a test of RE.2 only, but in a version that is robust to violations of RE.3.

The approach I describe next is well suited to solve both difficulties at once.

8.5.2. The Mundlak’s test. Mundlak (1978) asks the following question. Is it possible

to find an estimator that is more efficient than LSDV within a framework that allows corre-

lation between individual effects, taken as random variables, and X? To provide an answer,

he takes the move from model (8.2.3) and supposes that the individual effects are linearly

correlated with regressors according to the following equation

↵ = 1N⇡0 +�

D0D��1

D0X⇡ + u

with E (↵|X) = Eh

↵| (D0D)

�1D0Xi

, and so E (u|X) = 0. Pre-multiplying both sides of the

foregoing equation by D and then replacing the right-hand side of the resulting equation into

(8.2.3) yields

(8.5.1) y = 1NT⇡0 +X� + P[D]X⇡ +Du+ ",

which is evidently a RE model extended to the inclusion of the P[D]X regressors. Model (8.5.1)

springs up from a restriction in (8.2.3) and hence seems promising for more efficient estimates.

But this is not the case. Mundlak proves, in fact, that FGLS-RE applied to equation (8.2.3)

returns the LSDV estimator, bLSDV for the � coefficients, bB � bLSDV for the ⇡ coefficients


and b0B for the constant term ⇡0, where b0B and bB are the components of the between

estimator, b⇤B, presented in Section 8.3.1.

To summarize Mundlak’s results

• The standard LSDV estimator for � in the FE model (equation (8.2.3)) is the FGLS-

RE estimator for � in the general RE model (8.5.1)

• The standard FGLS-RE estimator in the RE model (equation (8.3.2)) can be equiv-

alently obtained as a constrained FGLS estimator applied to the general RE model

(8.5.1) with constraints ⇡ = 0.

Therefore, the validity of the RE model can be tested by applying a standard Wald test of

joint significance for the null hypothesis that ⇡ = 0 in the context of Mundlak’s equation

(8.5.1):

M = (bLSDV � bB)0 \Avar (bLSDV � bB)

�1(bLSDV � bB) .

Under H0 : ⇡ = 0, M �!d

�2(k).

Hausman and Taylor (1981) proves that the statistics H and M are numerically identical

(for a simple proof see also Baltagi (2008)). Wooldridge (2010), p. 334, nonetheless, rec-

ommends using the regression-based version of the test because it can be made fully robust

to violations of RE.3 (for example, heteroskedasticity and/or arbitrary within-group serial

correlation) using the standard robustness options available for regression commands in most

econometric packages. In addition, it is relatively easy to detect and solve singularity problems

in the context of regression-based tests.

8.5.3. Stata implementation. The Stata implementation of most results in this section

is demonstrated through a Stata do file available on the course learning space.

8.6. LARGE-SAMPLE RESULTS FOR THE LSDV ESTIMATOR 129

8.6. Large-sample results for the LSDV estimator

8.6.1. Introduction. This section proves consistency and asymptotic normality of the

LSDV estimator, then describes the heteroskedasticity and within-group serial correlation

consistent covariance estimator and finally provides a remark for practitioners.

Notation is standard. X denotes the (NT ⇥ k) regressors matrix (of all time-varying

regressors) and is partitioned by stacking individuals

(8.6.1) X =

0

B

B

B

B

B

B

B

B

B

B

@

X1

...

Xi

...

XN

1

C

C

C

C

C

C

C

C

C

C

A

,

with Xi indicating the (T ⇥ k) block of observations peculiar to individual i = 1, ..., N. Simi-

larly, observations in the (NT ⇥ 1) vectors y and " are stacked by individuals.

The projection matrix M[D] projects onto the space orthogonal to that spanned by the

columns of the individual dummies matrix D and any conformable vector that is post-multiplied

to it gets transformed into group mean deviations. It is not hard to see that M[D] is a block

diagonal matrix, with blocks all equal to

M[1T

] = IT � 1T10T

T.

So,

(8.6.2) M[D] =

2

6

6

6

6

6

6

6

4

M[1T

] 0 · · · 0

0 M[1T

]. . . ...

... . . . . . .0

0 · · · 0 M[1T

]

3

7

7

7

7

7

7

7

5

.


The LSDV estimator for the coefficients on the X regressors, �, is given by

bLSDV =

�

X 0M[D]X��1

X 0M[D]y.

Strict exogeneity is maintained throughout:

SE: E ("|X) = 0.

The following random sampling assumption is invoked for the asymptotic normality of

bLSDV and the consistency of the bLSDV asymptotic covariance estimator:

RS: There is a sample of size n = NT , such that the elements of the sequence

{(yi Xi) , i = 1, ..., N} are independent (NB not necessarily identically distributed)

random matrices.

8.6.2. Large-sample properties of LSDV. Let V ar ("|X) = ⌃, where ⌃ is an arbi-

trary and unknown p.d. matrix.

8.6.3. Consistency. The following assumptions hold.

LSDV.1: p limN!1

⇣

X0M[D]⌃M[D]XN

⌘

= lim

N!1E⇣

X0M[D]⌃M[D]XN

⌘

⌘ Q⌃, a positive def-

inite and finite matrix

LSDV.2: p limN!1

⇣

X0M[D]XN

⌘

= Q a positive definite and finite matrix

Exercise 64. (This has been done in class) Prove that under LSDV.1 and LSDV.2

p limN!1 bLSDV = �.

8.6.4. Asymptotic normality. Assumptions LSDV.1 and LSDV.2 hold along with RS

and the following


LSDV.3: V ar ("|X) = ⌃, where

⌃ =

2

6

6

6

6

6

6

6

4

⌃1 0 · · · 0

0 ⌃2. . . ...

... . . . . . .0

0 · · · 0 ⌃N

3

7

7

7

7

7

7

7

5

is a block diagonal (NT ⇥NT ) positive definite matrix. Notice that the blocks of ⌃ are

arbitrary and heterogenous, so that both arbitrary correlation across the time observations

of the same individual (referred to as within-group serial correlation) and heteroskedasticity

across individuals and over time are permitted. What is not permitted by the block-diagonal

structure is correlation of the " realizations across different individuals.

Now focus on the generic individual i = 1, ..., N and notice that, given the block-diagonal

form of M[D] as in (6.1.7),

M[D]X =

2

6

6

6

6

6

6

6

4

M[1T

] 0 · · · 0

0 M[1T

]. . . ...

... . . . . . .0

0 · · · 0 M[1T

]

3

7

7

7

7

7

7

7

5

0

B

B

B

B

B

B

B

B

B

B

@

X1

...

Xi

...

XN

1

C

C

C

C

C

C

C

C

C

C

A

=

0

B

B

B

B

B

B

B

B

B

B

@

M[1T

]X1

...

M[1T

]Xi

...

M[1T

]XN

1

C

C

C

C

C

C

C

C

C

C

A

The proof of asymptotic normality for bLSDV parallels that in 7.2.2 with the only difference

that now the random objects whence we move are not k ⇥ 1 vectors at the observation level

but k ⇥ 1 vectors at the individual level, X 0iM[1

T

]"i, i = 1, ..., N .

First, by strict exogeneity E�

X 0iM[1

T

]"i�

= 0 and hence

V ar�

X 0iM[1

T

]"i|X�

= E�

X 0iM[1

T

]"i"0iM[1

T

]Xi|X�

= X 0iM[1

T

]⌃iM[1T

]Xi,

8.7. A ROBUST COVARIANCE ESTIMATOR 132

so that

V ar�

X 0iM[1

T

]"i�

= E�

X 0iM[1

T

]⌃iM[1T

]Xi�

.

Then, averaging across individuals

1

N

NX

i=1

V ar�

X 0iM[1

T

]"i�

=

1

N

NX

i=1

E�

X 0iM[1

T

]⌃iM[1T

]Xi�

= E

✓

X 0M[D]⌃M[D]X

N

◆

.

Therefore,

lim

N!1

1

N

NX

i=1

V ar�

X 0iM[1

T

]"i�

= lim

N!1E

✓

X 0M[D]⌃M[D]X

N

◆

⌘ Q⌃,

which is a finite matrix by assumption LSDV.1, so that the Lindberg-Feller theorem applies

to yield pN

N

NX

i=1

X 0iM[1

T

]"i ⌘X 0M[D]"p

N�!d

N (0, Q⌃) .

Finally, since

pN (bLSDV � �) ⌘

✓

X 0M[D]X

N

◆�1 X 0M[D]"pN

�!d

Q�1X0M[D]"pN

,

pN (bLSDV � �) �!

dN�

0, Q�1Q⌃Q�1�

and the asymptotic covariance matrix of bLSDV is given by

(8.6.3) Avar (bLSDV ) =1

NQ�1Q⌃Q

�1.

8.7. A Robust covariance estimator

Arellano (1987) demonstrates that given the (T ⇥ 1) LSDV residual vector

eLSDV,i = M[1T

]yi �M[1T

]XibLSDV ,

8.7. A ROBUST COVARIANCE ESTIMATOR 133

i = 1, ..., N, a consistent estimator for the asymptotic covariance matrix of bLSDV in equation

(8.6.3) is given by the White’s estimator:

(8.7.1) \Avar (bLSDV ) =�

X 0M[D]X��1

X 0M[D]ˆ

⌃M[D]X�

X 0M[D]X��1

,

where ˆ

⌃ is a block diagonal matrix with generic block given by eLSDV,ie0LSDV,i. More formally,

ˆ

⌃ = eLSDV e0LSDV ⇤DD0.

Remark 65. The estimator in (8.7.1) is robust to arbitrary heteroskedasticity and within-

group serial correlation. Stock and Watson (2008) prove that in the LSDV model the White’s

estimator correcting for heteroskedasticity only, where ˆ

⌃ is a diagonal matrix with generic

element e2LSDV,it (see the first formula of section 9.6.1 in Greene (2008)), is inconsistent for

N ! 1. The crux of Stock and Watson’s argument is essentially algebraic, in that demeaned

residuals are correlated over time by construction and this correlation does not vanish for

N ! 1. The recommendation for practitioners is then to correct for both heteroskedasticity

and within-group serial correlation using the estimator (8.7.1), which is not affected by the

Stock and Watson’s critique.

Remark 66. In Stata the robust covariance matrix of LSDV is computed easily by using

the xtreg command with the options fe and vce(cluster id), where id is the name of the

individual categorical variable in your Stata data set.

A similar correction can be carried out for POLS and FGLS-RE. For POLS we have

\Avar�

b

⇤POLS

�

=

⇣

X⇤0X⇤⌘�1

X⇤0ˆ

⌃X⇤⇣

X⇤0X⇤⌘�1

,

where

ˆ

⌃ = ePOLSe0POLS ⇤DD0,

and the POLS residual vector defined as in equation (8.2.10), whereas for FGLS-RE we have

8.8. UNBALANCED PANELS 134

\Avar�

b

⇤FGLS�RE

�

=

⇣

X⇤0b

⌦

�1X⇤⌘�1

X⇤0b

⌦

�1/2ˆ

⌃

b

⌦

�1/2X⇤⇣

X⇤0b

⌦

�1X⇤⌘�1

,

where

ˆ

⌃ = eFGLS�REe0FGLS�RE ⇤DD0,

and the FGLS-RE residual vector defined as

eFGLS�RE = y �X⇤bFGLS�RE .

Remark 67. In Stata the robust asymptotic covariance matrices of POLS and FGLS-RE

is estimated by using, respectively, the regress and the xtreg, re commands, both with the

option vce(cluster id), as in the LSDV case.

8.8. Unbalanced panels

All of the methods so far have been described with a balanced panel data set in mind, but

nothing prevents applying the same methods to unbalanced panels (different numbers of time

observations across individuals).

Unbalanced panels only require a slight change in notation. As always we index individuals

by i = 1, ..., N , but now the size of each individual cluster, or group, of observations varies

across individuals and so the time index is t = 1, ..., Ti. This implies the following three facts.

(1) As in balanced panels, each observation in the data is uniquely identified by the two

indexes: the pair (i, t) identifies the t.th observation of the the i.th individual.

(2) Differently from balanced panels, the group size, Ti, is no longer constant across

clusters.

(3) Differently from balanced panels, the sample-size is n =

PNi=1 Ti.

The LSDV estimator is implemented without any problem either creating individual dummies

or taking variables in group-mean deviations, where group means are at the individual level.

8.8. UNBALANCED PANELS 135

The random effect estimator requires only some algebraic modifications in the formulas allow-

ing for unbalancedness. Arellano estimator also requires simple modifications in notations to

accommodate unbalancedness: there is now a (Ti ⇥ 1) LSDV residual vector given by

eLSDV,i = M[

1

T

i

]

yi �M[

1

T

i

]

xibLSDV ,

i = 1, ..., N, and so matrix ˆ

⌃ in (8.7.1) is a block diagonal matrix with blocks that are now of

different size. The notation using the Hadamard product does not instead require adjustments

to unbalancedness.

CHAPTER 9

Robust inference with cluster samplings

9.1. Introduction

The Panel-data sets considered in these notes, with a large individual dimension and a

small time dimension, are an example of one-way clustering. If the data-set is balanced, there

are n = NT observations clustered into N individual groups, each comprising T observations.

If the data-set is unbalanced, as often the case with real-world panels, there are n =

PNi=1 Ti

observations clustered into N individual groups, each comprising Ti observations, i = 1, ..., N .

One-way clusterings can be observed also in cross-section data. Think for example of a

large sample of students clustered into many schools. The data structure parallels exactly

that of an unbalanced panel, just index schools by i = 1, ..., N and students within schools

by t = 1, ..., Ti. So, any observation in the data is uniquely identified by the values of i and

t. In other words, observation (i, t) refers to the t.th student in the i.th school. Therefore,

under random sampling of schools and arbitrary sampling of students within schools all of

the statistical methods described in chapter 8 can be conveniently used. This means that

pooled OLS, fixed and random effect estimators can be applied to clustered cross-sections.

The F-test on individual effects can be used to gauge the presence of latent heterogeneity.

The robust Hausman test can be used to discriminate between fixed and random effects and,

importantly, the White-Arellano estimator described in Section 8.7 can be used for computing

robust standard errors. For more on the parallel between unbalanced panel data and one-way

clustered cross sections see chapter 20 in Wooldridge (2010).

Dealing with one-way clustering is an important advance in econometrics. It is often the

case, however, that real-world economic data have a multi-dimensional structure, so clustering

136

9.2. TWO-WAY CLUSTERING 137

can occur along more than one dimension. In a student survey, for example, there could be an

additional level of clustering given by teachers, or classes, within schools. Similarly, patients

can be clustered along the two dimensions, not necessarily nested, of doctors and hospitals.

In a cross-sectional data-set of bilateral trade-flows, the cross-sectional units are the pairs of

countries and these are naturally clustered along two dimensions: the first and the second

country in the pair (Cameron et al., 2011). ln matched-employers-employees data there is the

worker dimension, the firm dimension and the time dimension (Abowd et al., 1999).

Is it possible to do inference that is robust to multi-way clustering as we do inference that is

robust to one-way clustering? A recent paper by Cameron et al. (2011) offers a computationally

simple solution extending the White estimator to multi-way contexts. In essence, their method

boils down to computing a number of one-way robust covariance estimators, that are then

combined linearly to yield the multi-way robust covariance estimator. It is, therefore, crucial

for the accuracy of the multi-way estimator that the one-way estimators be also accurate,

and so that the data-set have dimensions with a large number of clusters. Such asymptotic

requirement makes the analysis in Cameron et al. (2011) not well suited for dealing with both

individual- and time-clustering in the typical micro-econometric panel data set, where T is

fixed. Indeed, their Monte Carlo experiments show that the robust covariance estimator have

good finite-sample properties in data-sets with dimensions of 100 clusters.

To illustrate the method I focus on two-way clustering, using a notation that is close to

that inCameron et al. (2011).

9.2. Two-way clustering

Notation is general enough to embrace cases in which cluster affiliations are not sufficient

to uniquely identify an observation. There is a data-set with n observations indexed by i 2

{1, ..., n}. Observations are clustered into two dimensions, g 2 {1, ..., G} and h 2 {1, ..., H} .

Asymptotics is for both G and H ! 1. The data-sets that I have in mind are, for example,

• Survey of students with, at least moderately large numbers of teachers and schools


• Survey of patients with, at least, moderately large numbers of doctors and hospitals

• Bilateral trade-flows data with, at least, a moderately large number of countries.

• Matched-employers-employees data with, at least, moderately large numbers of firms

and workers

For each dimension, it is known to which cluster a given observation i = 1, ..., n belongs. This

information is contained in the mappings g : {1, ..., n} ! {1, ..., G}

g (i) = [g 2 {1, ..., G} : observation i belongs to cluster g] , i = 1, ..., n.

and h : {1, ..., n} ! {1, ..., H}

h (i) = [h 2 {1, ..., H} : observation i belongs to cluster h] , i = 1, ..., n.

From the mappings g and h we can also construct the n⇥G dummy variables matrix DG and

the n⇥H dummy variables matrix DH , as the following definitions indicates

Definition 68. Let

dig =

8

<

:

1 if g (i) = g

0 else,

i 2 {1, ..., n}, g 2 {1, ..., G}, and

dih =

8

<

:

1 if h (i) = h

0 else,

i 2 {1, ..., n}, h 2 {1, ..., H}. Then, DG and DH are the n⇥G and n⇥H matrices with (i, g)

element dig and (i, h) element dih, respectively.

Given g and h, we can define an intersection dimension, say G\H, such that each cluster in

G\H contains only observations that belong to one unique cluster in {1, ..., G} and one unique

cluster in {1, ..., H} . This yields the matrix of dummy variables DG\H . By construction, the


number of clusters in the G \H dimension is at most G⇥H. For example if

DG=

0

B

B

B

B

B

B

B

B

B

B

B

B

B

@

1 0 0

1 0 0

0 1 0

0 1 0

0 1 0

0 0 1

1

C

C

C

C

C

C

C

C

C

C

C

C

C

A

, DH=

0

B

B

B

B

B

B

B

B

B

B

B

B

B

@

1 0

1 0

1 0

0 1

0 1

1 0

1

C

C

C

C

C

C

C

C

C

C

C

C

C

A

,

then

DG\H=

0

B

B

B

B

B

B

B

B

B

B

B

B

B

@

1 0 0 0

1 0 0 0

0 1 0 0

0 0 1 0

0 0 1 0

0 0 0 1

1

C

C

C

C

C

C

C

C

C

C

C

C

C

A

.

This framework allows that in a survey of patients, for example, there could be more than

one patients admitted to the same hospital and under the assistance of the same doctor. Or,

similarly, that in a panel data matching workers with firms the same worker can move across

firms over time or that, conversely, the same firm may employ different workers over time.

Then, define three n ⇥ n indicator matrices: SG= DGDG0 , SH

= DHDH0 and SG\H=

DG\HDG\H0

It is easy to verify that:

• SG has ijth entry equal to one if observations i and j share any cluster g in {1, ..., G}

; zero otherwise.

• SH has ijth entry equal to one if observations i and j share any cluster h in {1, ..., H};

zero otherwise.


• SG\H has ijth entry equal to one if observations i and j share any cluster g in

{1, ..., G} and any cluster h in {1, ..., H}; zero otherwise.

Also, the iith entries in SG, SH and SG\H equal one for all i = 1, ..., n, so the three indicator

matrices have main diagonals with all unity elements.

Consider now a linear regression model allowing for two-way clustering

yi, = x

0i,� + "i

i = 1, ..., n and let

" =

0

B

B

B

B

B

B

B

B

B

B

@

"1...

"i,...

"n

1

C

C

C

C

C

C

C

C

C

C

A

.

Assumptions LRM.1-LRM.3 hold. Assumption LRM.4 is here replaced with a more general

one permitting arbitrary heteroskedasticity and maintaining zero correlation only between

errors peculiar to observations that share no cluster in common. For example, the latent error

of patient i is not correlated to the latent error of patient j only if the two subjects are under

the assistance of different doctors, say g (i) 6= g (j), and in different hospitals, h (i) 6= h (j).

Formally,

LRM.4b: V ar ("|X) ⌘ ⌃ = E (""0|X) with E ("i"j |X) = 0 unless g (i) = g (j) or

h (i) = h (j), i, j = 1, ..., n.

Importantly, LRM.4b can equivalently be expressed as

(9.2.1) ⌃ = E�

""0 ⇤ SG|X�

+ E�

""0 ⇤ SH |X�

� E�

""0 ⇤ SG\H |X�

,

where the symbol ⇤ stands for the element-by-element matrix product (also known as Hadamard

product) between matrices with equal dimension (verify equivalence of LRM.4b and (9.2.1)).


As we know, OLS, in this case, are consistent and unbiased but not efficient. More impor-

tantly, OLS standard errors are biased, and so we need a two-way robust covariance estimator

for inference. The covariance estimator devised by Cameron et al. (2011) is the combination

of three one-way covariance estimators a la White. It is constructed along the following steps.

Carry out OLS, obtain the OLS residuals

ei,g(i),h(i) = yi,g(i),h(i) � x

0i,g(i),h(i)b

i = 1, ..., n and stack them into the n⇥ 1 column vector

e =

0

B

B

B

B

B

B

B

B

B

B

@

e1,g(1),h(1)...

ei,g(i),h(i)...

en,g(n),h(n)

1

C

C

C

C

C

C

C

C

C

C

A

.

The first one-way covariance estimator is

\Avar (b)G=

�

X 0X��1

X 0ˆ

⌃

GX�

X 0X��1

where ˆ

⌃

G= ee

0 ⇤ SG. \Avar (b)G

is a White estimator that is robust to clustering only along

the G dimension.

The second one-way covariance estimator is

\Avar (b)H

=

�

X 0X��1

X 0ˆ

⌃

HX�

X 0X��1

where ˆ

⌃

H= ee

0 ⇤ SH . \Avar (b)H

is a White estimator that is robust to clustering only along

the H dimension.

The third one-way covariance estimator is

\Avar (b)G\H

=

�

X 0X��1

X 0ˆ

⌃

G\HX�

X 0X��1

9.3. STATA IMPLEMENTATION 142

where ˆ

⌃

G\H= ee

0 ⇤SG\H . \Avar (b)G\H

is a White estimator that is robust to clustering only

along the G \H dimension.

Finally, the two-way robust covariance estimator is

(9.2.2) \Avar (b) = \Avar (b)G+

\Avar (b)H� \Avar (b)

G\H.

\Avar (b) is robust to clustering along both G and H dimensions and is the estimator that

is used to construct our robust tests.

Remark 69. Writing \Avar (b) as

\Avar (b) =�

X 0X��1

X 0⇣

ˆ

⌃

G+

ˆ

⌃

H � ˆ

⌃

G\H⌘

X�

X 0X��1

and then considering equation (9.2.1) uncovers the analogy principle on which the two-way

robust covariance estimator rests.

Remark 70. Cameron et al. (2011) also present a general multi-way version of \Avar (b),

which is derived from a simple extension of the foregoing analysis. The additional cost is only

in terms of a more cumbersome notation. For the formulas I refer you to that paper.

9.3. Stata implementation

While there is no official command for the two-way \Avar (b) in Stata, it can be simply

implemented by means of three one-way OLS regressions. Suppose that in our data-set the

two categorical variables for dimensions G and H are called doctor and hospital. You can

assemble \Avar (b) along the following steps.

(1) Create the categorical variable for the intersection dimension, G \ H, through the

following instruction

egen doc�hosp = group (doctor hospital)

where doc_hosp is a name of choice.

9.3. STATA IMPLEMENTATION 143

(2) Implement the first regress instruction with the option vce(cluster doctor) and

then save the covariance matrix estimate through the command: matrix V_d=e(V)

(V_d is a name of choice).

(3) Implement the second regress instruction with the option vce(cluster hospital)

and then save the covariance matrix estimate with: matrix V_h=e(V) (V_h is a

name of choice).

(4) Implement the last regress instruction with the option vce(cluster doc_hosp) and

then save the covariance matrix estimate with: matrix V_dh=e(V) (V_dh is a name

of choice).1

(5) Finally, work out the two-way robust covariance estimator by executing: matrix

V_robust=V_d+V_h-V_dh (V_robust is a name of choice). To see the content of

V_robust do: matrix list V_robust. The robust standard errors are just the

square roots of the main diagonal elements in V_robust.

1It may happen that clusters in the intersection dimension are all singletons (i.e. each cluster has only one

observation). In this case Stata will refuse to work with the option vce(cluster doc_hosp). This is no

problem, though, since correcting standard errors when clusters are singletons is clearly equivalent to correcting

for heteroskedasticity. Therefore, instead of vce(cluster doc_hosp), simply write vce(robust).

CHAPTER 10

Issues in linear IV and GMM estimation

10.1. Introduction

The conditional-mean-independence of " and x maintained by P.2 (Section 2.1) often fails

in economic structures, where some of the x variables are chosen by the economic subjects

and as such may depend on the latent factors at the equilibrium. These x variables are said

endogenous.

In economics, think of a production function, where (some of) the observable input quan-

tities are under the firm’s control. The same consideration holds for the education variable

in a wage equation. These are all cases of omitted variable bias (Section 4.7), which makes

standard estimation techniques not usable.

As we have seen in Section 4.7.1, the proxy variables solution maintains that there is infor-

mation external to the model that is able to fully explain the correlation between observed and

unobserved variables. For example observed IQ scores, clearly redundant in a wage equation

with latent ability, are an imperfect measure of latent ability, but the discrepancy between the144

10.1. INTRODUCTION 145

two variables is likely to be unrelated with the individual education levels. Such information,

so close to the latent variable, is often unavailable, though.

If the latent variables are invariant across individual and/or over time and there is a panel-

data set, the endogeneity problem is solved by applying the panel-data methods introduced

in Chapter 8. But not always panel data are available and even when they are, the disturbing

omitted factors may not meet the time-constancy requirement. For example, idiosyncratic

productivity shocks may well be related to input factors in the estimation of a production

function.

Neither proxy variables or panel data methods are generally usable when endogeneity

springs from reverse causality. In the strip, Wally questions the exogeneity of the exercise

variable as a determinant of individual health, hinting for an endogeneity bias due to reverse

causality. If the exercise activity is indeed affected by the health status, exercise would depend

on the observable and unobservable determinants of health, and so cannot be exogenous.

Instrumental variables (IV) and Generalized method of moment (GMM) estimators offer

a general solution to the endogeneity problem. Roughly speaking, they solve the endogeneity

problem into two stages. The first stage attempts to identify the exogenous-variation compo-

nents of the x, through a set of exogenous variables, some of which are external to the model,

said instrumental variables. The second stage applies regression analysis using only the first-

stage exogenous components as explanatory variables. IV and GMM methods are preferred

tools of econometric analysis, compared to alternative techniques, since often the first stage

can be justified on the ground of economic theory.

There are various IV GMM applications showing the methods of this chapter: iv.do using

mus06data.dta, IV_GMM_panel.do using costfn.dta and IV_GMM_DPD.do and abest.do both

using abdata.dta. There is also a Monte Carlo application implemented by bias_in_AR1_LSDV.do.

10.2. THE METHOD OF MOMENTS 146

10.2. The method of moments

The method of moments estimates the parameters of interest by replacing population

moment conditions with their sample analogs. Almost all popular estimators can be thought

of as methods of moments estimators. Below there are two examples.

10.2.1. The linear regression model. Consider the linear model of Chapter 1 and the

system of moment conditions (1.2.3)

E (xy) = E�

xx

0��.

So, the true coefficient vector, �, solve the population moment conditions and is equal to

� = E (xx

0)

�1E (xy). By the analogy principle a consistent estimator for �, b�⇤, will satisfy

the system of k analog sample moment conditions:

1

n

nX

i=1

xi

⇣

yi � x

0ib�⌘

= 0.

Hence,

b�⇤=

nX

i=1

xix0i

!�1 nX

i=1

xiyi =�

X 0X��1

X 0y,

which is exactly the OLS estimator.

10.2.2. The Instrumental Variable (IV) regression model in the just identified

case. Consider the linear model of Chapter 1 but without assumption P.3, E ("|x) = 0, or

even the weaker P.3b, E (x") = 0. This means that some of the variables in x are potentially

endogenous, that is related in some way to ". Assume, instead, conditional mean independence

for a L ⇥ 1 vector of variablesz, that is E ("|z) = 0, with L = k. The vector z is generally

different from x, if it is not then we are back to the classical regression model and there is no


endogeneity problem. Then, as before using the law of iterated expectations we have

E (z") = Ez

[E (z"|z)]

= Ez

[zE ("|z)]

= 0.

So, there are k moment conditions in the population

E⇥

z

�

y � x

0��⇤

= 0

or equivalently

E (zy) = E�

zx

0��.

So, the true coefficient vector, �, solve the population moment conditions and is equal to

� = E (zx

0)

�1E (zy). By the analogy principle a consistent estimator for �, b�⇤⇤

, will satisfy

the system of k analog sample moment conditions:

1

n

nX

i=1

zi

⇣

yi � x

0ib�⌘

= 0.

Hence,

b�⇤⇤

=

nX

i=1

zix0i

!�1 nX

i=1

ziyi =�

Z 0X��1

Z 0y,

which is the classical IV estimator.

The intuition is straightforward: since the true coefficients solve the population moment

conditions, if the sample moments provide good estimates for the population moments, then

one might expect that the estimator solving the sample moment conditions will provide good

estimates of the true coefficients.

What if there are more moment conditions than unknown parameters, that is if L > k?

Then we turn to GMM estimation.


10.2.3. The Generalized Method of Moments. GMM estimation is general: it can

be applied to both linear and non-linear models and in the over-identified case L > k. To see

this, define the column vector of observables in the population w ⌘ (y x0z

0)

0. There are L > k

population moments collected into the (L⇥ 1) column vector m (�) :

m (�) ⌘ E [f (w,�)]

and suppose that the following population moment conditions hold

m (�) = 0.

Now consider the L sample moments

m

⇣

b�⌘

⌘ 1

n

NX

i=1

f

⇣

wi, b�⌘

and the L sample moment conditions

m

⇣

b�⌘

= 0

hence there are L equations and k unknowns so that no estimator b� can solve the system of

sample moment conditions. Instead, there exists a b� that can make m

⇣

b�⌘

as close to zero as

possible:

b�GMM = argmin

b�Q⇣

b�⌘

where Q⇣

b�⌘

⌘ m

⇣

b�⌘0

Am

⇣

b�⌘

is a quadratic criterion function of the sample moments and

A is a positive definite matrix weighting the squares and the cross-products of the sample

moments in Q⇣

b�⌘

.

Note that Q⇣

b�⌘

� 0 and since A is positive definite, Q⇣

b�⌘

= 0 only if m

⇣

b�⌘

= 0.

Thus, Q⇣

b�⌘

can be made exactly zero in the just identified case and is strictly greater than

zero in the over-identified case.


10.2.4. The TSLS estimator. It is not hard to prove that the well-known Two Stages

Least Squares estimator (TSLS) in the overidentified linear model belongs to the class of GMM

estimators. Consider the linear regression model of Section 10.2.2 with L > k instruments.

Then, there are the following population moments

m (�) ⌘ E⇥

z

�

y � x

0��⇤

and population moment conditions

E⇥

z

�

y � x

0��⇤

= 0

The L sample moments are collected into the (L⇥ 1) vector m

⇣

b�⌘

m

⇣

b�⌘

⌘ 1

n

nX

i=1

zi

⇣

yi � x

0ib�⌘

⌘ 1

nZ 0

⇣

y �Xb�⌘

Suppose we choose a quadratic criterion function with the following weighting matrix

A ⌘

1

n

nX

i=1

ziz0i

!�1

= n�

Z 0Z��1

.

Then

Q⇣

b�⌘

⌘ 1

n

⇣

y �X b�⌘0

Z�

Z 0Z��1

Z 0⇣

y �Xb�⌘

with the following normal equations for the minimization problem:

@Q⇣

b�⌘

@b�⌘ 2

nX 0Z

�

Z 0Z��1

Z 0⇣

y �Xb�⌘

= 0

that solved for b� yield the TSLS estimator

b�TSLS ⌘⇣

X 0Z�

Z 0Z��1

Z 0X⌘�1

X 0Z�

Z 0Z��1

Z 0y.

or more compactly

b�TSLS ⌘�

X 0P[Z]X��1

X 0P[Z]y.

10.3. STATA IMPLEMENTATION OF THE TSLS ESTIMATOR 150

The estimator’s name derives from the fact that it is computed into two stages:

(1) Regress each column of X on Z using OLS to obtain the OLS fitted values of X:

Z (Z 0Z)

�1 Z 0X = P[Z]X. Thus, X = P[Z]X + M[Z]X, where P[Z]X is an approxi-

mately exogenous component, whose covariance with " goes to zero as n ! 1, and

M[Z]X is a residual, potentially endogenous, component. Only P[Z]X is used in the

second stage.

(2) Regress y on the fitted values, P[Z]X, to obtain TSLS.

If the population moment conditions are true, then Q⇣

b�TSLS

⌘

should not be significantly

different from zero. This provides a test for the validity of the L� k over-identifying moment

conditions based on the following statistic (Hansen-Sargan test)

S ⌘ nQ⇣

b�TSLS

⌘

⇠ �2(L� k) .

Exercise 71. Prove that if L = k, TSLS collapses to IV

Solution: Z 0X is invertible, so

b�TSLS ⌘⇣

X 0Z�

Z 0Z��1

Z 0X⌘�1

X 0Z�

Z 0Z��1

Z 0y

=

�

Z 0X��1

Z 0Z�

X 0Z��1

X 0Z�

Z 0Z��1

Z 0y

=

�

Z 0X��1

Z 0y

10.3. Stata implementation of the TSLS estimator

It is implemeted by the command ivregress 2sls followed by the names of the dependent

variable, the included exogenous and, within parentheses, all the right-hand-side endogenous

and the excluded exogenous (the instruments) as follows

ivregress 2sls depvar indepvars (endog_vars = instruments), options

10.4. STATA IMPLEMENTATION OF THE (LINEAR) GMM ESTIMATOR 151

10.4. Stata implementation of the (linear) GMM estimator

It is implemeted by ivregress gmm followed by the names of the dependent variable,

the included exogenous and, within parentheses, all the right-hand-side endogenous and the

excluded exogenous (the instruments) as follows

ivregress gmm depvar indepvars (endog_vars = instruments), options

10.4.1. Choosing the weighting matrix . The weighting matrix in the optimal two-

step GMM estimator is

(10.4.1) A =

⇣

Z 0ˆ

⌃Z/n⌘�1

(see Hansen 1982). A is a consistent estimate of the inverse of V arh

zi

⇣

yi � x

0i�⌘i

.

Choice of ˆ

⌃ :

• If " is homoskedastic and independent then ˆ

⌃ = I (the resulting GMM estimator col-

lapses to TSLS). It’s implemented through the ivregress gmm option: wmatrix(unadjusted).

• If " is heteroskedastic and independent then ˆ

⌃ is a diagonal matrix with generic diag-

onal element equal to the squared residual from some one-step consistent estimator,

the TSLS for example:

ˆ

⌃ =

0

B

B

B

B

B

B

B

@

e21 0 · · · 0

0 e22...

... . . .0

0 · · · 0 e2n

1

C

C

C

C

C

C

C

A

with ei = yi � x

0ib�TSLS , i = 1, ..., n. It’s implemented through the ivregress gmm

option: wmatrix(robust). It’s the default.

• If errors are clustered, with N clusters, then ˆ

⌃ is a block diagonal matrix with generic

block equal to the outer product of the residuals peculiar to the corresponding cluster.

10.6. DURBIN-WU-HAUSMAN EXOGENEITY TEST 152

Again residuals are taken from a one-step consistent regression (TSLS):

ˆ

⌃ =

0

B

B

B

B

B

B

B

@

ˆ

⌃1 0 · · · 0

0

ˆ

⌃2...

... . . .0

0 · · · 0

ˆ

⌃N

1

C

C

C

C

C

C

C

A

with ˆ

⌃i = eie0i, i = 1, ..., N . Notice that in this case ei = yi�x

0ib�TSLS is a vector and

not a scalar: it is the vector of residual observations peculiar to cluster i = 1, ..., N. It’s

implemented through the ivregress gmm option: wmatrix(cluster cluster_var).

10.4.2. Iterative GMM. The GMM procedure can be iterated by adding the option

igmm. The resulting estimator is asymptotically equivalent to the two-step estimator. However,

Hall (2005) suggests that it may have a better finite-sample performance.

10.5. Robust Variance Estimators

The less efficient, but computationally simpler and still consistent, TSLS estimator is

often used in estimation. Its robust variance-covariance matrix V ar⇣

b�TSLS

⌘

is consistently

estimated as

\V ar

⇣

b�TSLS

⌘

=

�

X 0P[Z]X��1

X 0P[Z]ˆ

⌃P[Z]X�

X 0P[Z]X��1

,

where ˆ

⌃ is chosen according to the various departures from homoskedasticity and independence

spelled out above. The Stata implementation of the three variance-covariance estimators is

through the ivregress options: vce(unadjusted), vce(robust), vce(cluster cluster_var).

10.6. Durbin-Wu-Hausman Exogeneity test

A conventional Hausman test can be always implemented, based on the Hausman’s sta-

tistics measuring the statistical difference between IV and OLS estimates. It is not robust to


heteroskedastic and clustered errors, though. Wu suggest an alternative. But before do this

exercise, which will prove useful in the derivations below.

Exercise 72. Prove that the TSLS estimator for �2 is

b2,TSLS =

⇣

X 02P

[

M[X1]Z1]

X2

⌘�1X 0

2P[

M[X1]Z1]

y

Solution. Applying the FWL Theorem to the second-stage regression

b2,TSLS =

⇣

X 02P[Z]M

[

P[Z]X1]P[Z]X2

⌘�1X 0

2P[Z]M[

P[Z]X1]P[Z]y

By Lemma 12 P[Z] = P[X1] + P[

M[X1]Z1]

, so that P[Z]X1 = X1 and

b2,TSLS =

�

X 02P[Z]M [X1]P[Z]X2

��1X 0

2P[Z]M [X1]P[Z]y.

But then P[Z] = P[X1]+P[

M[X1]Z1]

also assures that P[Z]M [X1] = P[

M[X1]Z1]

, proving the result.

The DWH test provides a robust version of the H test. It maintains instruments valid-

ity, E ("|Z) = 0 and is based on the so called control-function approach, which recasts the

endogeneity problem as a misspecification problem affecting the structural equation

(10.6.1) y = X� + ",

X = (X1 X2), � =

�

�01 �

02

�

’, Z = (X1 Z1) and " = u + ⌫⇡. The component u is such that

E (u|X) = 0 and ⌫ is the n⇥k2-matrix of the errors in the first-stage equations of the variables

X2. As such, ⌫ is responsible for endogeneity of X2.

Replacing ⌫ in (10.6.1) with the residuals from the first stage regressions, ⌫ = M[Z]X2,

makes the DWH test operational as a simple test of joint significance for ⇡ in the auxiliary

OLS regression

(10.6.2) y = X� +M[Z]X2⇡ + u⇤.


The test works well since under the alternative of ⇡ 6= 0, OLS estimation of the auxiliary

regression yields the TSLS estimators. This is proved as follows.

y =

�

P[Z] +M[Z]

�

X� +M[Z]X2⇡ + u⇤

and so

y = P[Z]X� +M[Z]X� +M[Z]X2⇡ + u⇤

but M[Z]X� = M[Z]X2�2 since by Lemma 12 M[Z] = M[X1] � P[

M[X1]Z1]

. Therefore,

y = P[Z]X� +M[Z]X2 (�2 + ⇡) + u⇤

and since P[Z]X and M[Z]X2 are orthogonal the FWL Theorem assures that the OLS estimator

for � is

bTSLS =

�

X 0P[Z]X��1

X 0P[Z]y.

and also

\�2 + ⇡ =

�

X 02M[Z]X2

��1X 0

2M[Z]y

=

�

X 02M[Z]X2

��1⇣

X 02M[X1]y �X 0

2P[

M[X1]Z1]

y

⌘

= Kb2,OLS + (I �K)b2,TSLS ,

with K ⌘�

X 02M[Z]X2

��1 �X 0

2M[X1]X2�

and where the last equation follows from Exercise 72.

Therefore

b⇡ = Kb2,OLS + (I �K)b2,TSLS � b2,TSLS

= K (b2,OLS � b2,TSLS) ,

proving that the test indeed follow the Hausman test general principle of assessing the dis-

tance between an efficient estimator and a consistent but inefficient estimator, under the null

hypothesis.

10.7. ENDOGENOUS BINARY VARIABLES 155

One great advantage of the DWH test over a conventional Hausman test is that it can

be easily robustified for heteroskedasticity and/or clustered errors by estimating (10.6.2) with

regress and a suitable robust option, vce(cluster clustervar) for example.

DWH can be immediately implemented in Stata through the ivregress postestimation

command estat endogenous.

10.7. Endogenous binary variables

The linear IV-GMM approach outlined so far fits the case of binary endogenous variables

producing consistent estimates. However, a first-stage regression fully accounting for the

binary structure of the endogenous variables may provide considerable efficiency gains. The

implied model (non-linear) is as follows

yi = x1i�1 + x2i�2 + ✏i

x⇤2i = x1i⇡1 + zi⇡2 + ⌫i

x2i =

8

>

<

>

:

1 if x⇤2i > 0

0 otherwise

(✏i, ⌫i) ⇠ N

2

4

0,

0

@

�2 ⇢�2

⇢�21

1

A

3

5 .

It is estimated by the Stata procedure

treatreg depvar indepvars, treat(endog_var = instruments) other_options

through either ML (default) or a consitent two-step procedure (twostep option).

10.9. INFERENCE WITH WEAK INSTRUMENTS 156

10.8. Testing for weak instruments

• Staiger and Stock’s rule of thumb: partial F tests in the first stage regression >

10. It is not rigorous, tends to reject too often weak intruments and has no obvious

implementation when there are more than one endogenous variables.

• Stock and Yogo’s (2005) two tests overcome all of the above difficulties. They are

both based on the on the minimum eigenvalue of the matrix analog of the partial

F test, a statistics introduced by Cragg and Donald (1993) to test nonidentification.

Importantly, the large-sample properties for both tests have been derived under the

assumption of homoskedastic and independent errors. Caution must be taken, then,

when drawing conclusions from the tests if the errors are thought to departure from

those hypotheses.

Both procedures are implemented by the ivregress postestimation command estat firststage.

10.9. Inference with weak instruments

Conditional inference on the endogenous variables coefficients in the presence of weak in-

struments is implemented through command condivreg by Mikusheva and Poi (2006). Theory

reviewed and expanded in Andrews et al. (2007). The command produces three alternative

confidence sets for the coefficient of the endogenous regressor obtained from the conditional

LR, Anderson-Rubin (option ar) and LM statistics (option lm). The syntax of condivreg is

similar to that of ivregress.

10.9.1. Three stages Least Squares. It’s a system estimator including structural equa-

tions for all endogenous variables. Identification is ensured by standard (sufficient) rank and

(necessary) order conditions. It is seldomly used as it is inconsistent in the presence of het-

eroskedastic errors, which is the norm in most micro applications. The Stata command is

reg3.

10.10. DYNAMIC PANEL DATA 157

10.10. Dynamic panel data

Situations in which past decisions have an impact on current behaviour are ubiquitous in

economics. For example in the presence of input adjustment costs, short-run input demands

depend also on past input levels. In such cases fitting a static model to data will lead to what

is referred to as dynamic underspecification. With a panel data set, however, it is possible to

implement a dynamic model from the outset to in order to describe the phenomena of interest.

To make things simple let us get started with the simple autoregressive process

(10.10.1) yit = ↵+ �yi,t�1 + ✏it

t = 1, ..., T , i = 1, ..., N .

Model (10.10.1) can be easily extended to allow for time invariant individual terms:

(10.10.2) yit = �yi,t�1 + ↵i + ✏it

t = 1, ..., T , i = 1, ..., N . In vector notation, stacking time observations for each individual,

yi = �y�1i + ↵i1T + ✏i

i = 1, ..., N, where

yi(T⇥1)

=

0

B

B

B

B

B

B

B

B

B

B

@

yi1...

yit...

yiT

1

C

C

C

C

C

C

C

C

C

C

A

, y�1i(T⇥1)

=

0

B

B

B

B

B

B

B

B

B

B

@

yi,0...

yi,t�1

...

yi,T�1

1

C

C

C

C

C

C

C

C

C

C

A

, "i(T⇥1)

=

0

B

B

B

B

B

B

B

B

B

B

@

"i1...

"it...

"iT

1

C

C

C

C

C

C

C

C

C

C

A

Notice that for each individual there are T + 1 observations available in the data set,

from yi1 to yiT , but only T are usable since one is lost to taking lags. The problem here is

that y�1i is not strictly exogenous. Given (10.10.2), the t.th realization of y�1i is yi,t�1 =

f(y0, ✏i1, ✏i2, ..., ✏it�1) and so all future realizations of y�1i, from yi,t = f(y0, ✏i1, ..., ✏it) to


yi,T�1 = f(y0, ✏i1, ...✏it, ...✏i,T�1), depend on ✏i,t, which makes E (✏i,t|yi,0, ...yi,T�1) = 0 fail

(exercise: Can you work out the exact expression of the right-hand side of yi,t�1 = f(y0, ✏i1, ✏i2, ..., ✏it�1)?).

Nonetheless, we may maintain conditional-mean independence between ✏i,t and the t.th and

more remote values of y�1i, say y

t�1i = (yi,t�1, yi,t�2, ..., yi,0)

0 using the notation in Arellano

(2003). More formally, we maintain throughout (see Arellano (2003) for a discussion)

A.1: E�

✏it|yt�1i ,↵i

�

= 0 for all t = 1, ..., T

Assumption A.1 is also considered in Wooldridge (chapter 11, 2010), where it is referred to as

sequential exogeneity conditional on the unobserved effect. It may be convenient sometimes to

maintain also the following (sequential) conditional homoskedasticity assumption

A.2: E�

✏2it|yt�1i ,↵i

�

= �2 for all t = 1, ..., T

It is not hard to prove that Equation (10.10.1) and Assumption A.1 implies the following

(prove it using the LIE and ✏i,t�j = yi,t�j � �yi,t�j�1 � ↵i)

A.3: E�

✏it✏it�j |yt�1i ,↵i

�

= 0, for all j = 1, ...t� 1

10.10.1. Inconsistency of the LSDV Estimator. Since y�1i is not strictly exogenous

the LSDV estimator, �LSDV , is inconsistent for N ! 1. Nickell (1981) was the first to derive

the inconsistency. Given,

�LSDV = � +

1NT

P

i

P

t

�

yi,t�1 � yi.�1

�

(✏it � ✏i.)

1NT

P

i

P

t

�

yi,t�1 � yi.�1

�2 ,

he showed that

plim1

NT

X

i

X

t

�

yi,t�1 � yi.�1

�

(✏it � ✏i.) = E

1

T

TX

t=1

�

yi,t�1 � yi.�1

�

(✏it � ✏i.)

!

=

= � 1

T 2

T � 1� T� + �T

(1� �)2�2✏ 6= 0.


Hence, the bias vanishes for T ! 1, but it does not for N ! 1 and T fixed. For this reason,

the LSDV estimator is inaccurate in panel data sets with large N and small T and is said to

be semi-inconsistent (see also Sevestre and Trognon, 1996).

Since Nickell (1981) a number of consistent IV and GMM estimators have been proposed

in the econometric literature as an alternative to LSDV. Anderson and Hsiao (1981) suggest

two simple IV estimators that, upon transforming the model in first differences to eliminate

the unobserved individual heterogeneity, use the second lags of the dependent variable, either

differenced or in levels, as an instrument for the differenced one-time lagged dependent variable.

Arellano and Bond (1991) propose a GMM estimator for the first differenced model which,

relying on all available lags of y�1i as instruments, is more efficient than Anderson and Hsiao’s.

Ahn and Smith (1995), upon noticing the Arellano and Bond estimator uses only linear moment

restrictions, suggest a set of non linear restrictions that may be used in addition to the linear

one to obtain more efficient estimates. Blundell and Bond (1998) observe that with highly

persistent data first-differenced IV or GMM estimators may suffer of a severe small sample

bias due to weak instruments. As a solution, they suggest a system GMM estimator with

first-differenced instruments for the equation in levels and instrument in levels for the first-

differenced equation. Some of the foregoing methods are nowadays very popular and are

surveyed below.

10.10.2. The Anderson and Hsiao IV Estimator. One typical solution is to take

model (10.10.2) in first differences to eliminate the individual effects:

(10.10.3) yit � yi,t�1 = �(yi,t�1 � yi,t�2) + ✏it � ✏i,t�1.

This makes the disturbances MA(1) with unit root, and so induces correlation between the

lagged endogenous variables and the disturbances. This problem can be solved by using

instruments for 4yi,t�1. Anderson and Hsiao (1982)Anderson and Hsiao (1982) suggest using

either yi,t�2, or 4yi,t�2 since these are correlated with 4yi,t�1 but are uncorrelated with


✏it � ✏i,t�1. It is an exactly identified IV estimator, consistent under Assumption A.1, but

generally non optimal and with a high root mean squared error in applications.

10.10.3. The Arellano and Bond GMM estimator. Arellano and Bond (1991) pro-

pose a more efficient estimator, using a larger set of instruments. Upon noticing that, given

(10.10.2), yi1 = f(✏i1), yi2 = f(✏i1, ✏i2), ..., yit = f(✏i1, ✏i2, ..., ✏it) and that after differencing

the first useable period in the sample is t = 3:

yi2 � yi1 = �(yi1 � yi0) + ✏i2 � ✏i1,

one finds that the value yi1 is a valid instrument for (yi2 � yi1). In fact, yi1 is correlated with

(yi2 � yi1) and given Assumption 1, E [yi1 (✏i3 � ✏i2)] = 0.

In the next period, t = 4, we have

yi4 � yi3 = �(yi3 � yi2) + ✏i4 � ✏i3,

and both yi1 and yi2 are valid instruments for (yi3 � yi2), and so on.

This approach adds an extra valid instruments with each forward period, so that in the

last period T we have (yi0, yi1, yi2, ..., yi,T�2) . For each individual i, the matrix of instruments

is therefore

Zi =

0

B

B

B

B

B

B

B

B

B

B

B

B

B

@

yi0 0 0 0 0 0 . . . . . 0

0 yi0 yi1 0 0 0 . . . . . 0

0 0 0 yi0 yi1 yi2 . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

0 0 0 0 0 0 . yi0 yi1 yi2 . yi,T�2

1

C

C

C

C

C

C

C

C

C

C

C

C

C

A

Stacking individuals the overall matrix of instrumental variables is

Z =

⇥

Z 01, Z

02, ..., Z

0N

⇤0,


and model (10.10.3) can be reformulated more compactly

�y = ��y�1

+�✏.

where

• �y is a (N(T � 1)⇥ 1) vector;

• �y�1

is a (N(T � 1)⇥ 1) vector;

• �✏ is a (N(T � 1)⇥ 1) vector

The number of instruments is L = T (T � 1)/2. So, Z is a (N(T � 1) ⇥ L) matrix. The

instrumental variables satisfy for each individual i the (L ⇥ 1) vector of population moment

conditions m(�) ⌘ Eh

Z0i�✏i

i

= 0, where �✏i stands for the i.th block of �✏. We can define

the (L ⇥ 1) vector of sample moment conditions m(�) =

1NZ 0

(�✏) =

1NZ 0

(�y � ��y�1

).

Since L > 1, this is an overidentified case and GMM estimation is needed.

If we also consider Assumption A.2, that is ✏, beyond being not serially correlated, is also

homoskedastic, the optimal GMM estimator can be obtained in one step. It minimizes the

following criterion function

(10.10.4) Q (b�) = m(b�)0Am(b�)

where according to what seen in Subsection 10.4.1 A is a consistent estimator of the inverse of

(10.10.5) V ar⇣

Z0i�✏i

⌘

= �2E⇣

Z0iGZi

⌘

up to an irrelevant positive scalar and


G =

0

B

B

B

B

B

B

B

B

B

B

B

B

B

@

2 �1 0 0 . . . 0 0

�1 2 �1 0 . . . 0 0

0 �1 2 �1 . . . . .

. . . . . . . . .

0 0 0 0 . . . 2 �1

0 0 0 0 . . . �1 2

1

C

C

C

C

C

C

C

C

C

C

C

C

C

A

.

The Arellano-Bond one-step estimator b�1 uses

A =

1

N

NX

i=1

Z0iGZi

!�1

in (10.10.4) and so

b�1 =

2

4

�y�1

0Z

NX

i=1

Z0iGZi

!�1

Z 0�y�1

3

5

�1

⇥

2

4

�y�1

0Z

NX

i=1

Z0iGZi

!�1

Z 0�y

3

5

Without homoskedasticity (that is without Assumption A.2), b�1 is no longer optimal, but it

remains consistent and as such can be used to construct the optimal two-step estimator b�2

along the lines described in Subsection 10.4.1. Therefore, b�2 minimizes (10.10.4) with

A =

1

N

NX

i=1

Z0i�e1i�e

01iZi

!�1


and where �e

1i = �yi � b�1�y�1i is the individual-level residual vector from the one-step

estimator:

b�2 =

2

4

�y�1

0Z

NX

i=1

Z0i�e

1i�e

1

0iZi

!�1

Z 0�y�1

3

5

�1

⇥

2

4

�y�1

0Z

NX

i=1

Z0i�e

1i�e

1

0iZi

!�1

Z 0�y

3

5

If the ✏it are iid(0,�2✏ ), b�1 and b�2 are asymptotically equivalent.

To test instrument validity one can apply the Hansen-Sargan test of overidentifying re-

strictions:

S =

NX

i=1

Z0i�e

2i

!0 NX

i=1

Z0i�e

2i�e

2

0iZi

!�1 NX

i=1

Z0i�e

2i

!

where �e

2i = �yi � b�2�y�1i are the individual-level residuals from the two-step estimator.

Under H0: Eh

Z0i�✏i

i

= 0 8i = 1, ..., N , S ⇠A�2L�1.

A second specification test suggested by Arellano and Bond (1991) is that testing lack of

AR(2) correlation in �e

1

or �e

2

, which must hold under Assumption A.1. The AR(2) test

under the null has a limiting standard normal distribution.

10.10.3.1. Inference issues. Monte Carlo studies tend to show that estimated standard

errors from 2-step GMM estimators are severely downward biased in finite samples (Arellano

and Bond 1991). This is not the case for 1-step GMM standard errors, which instead are

virtually unbiased. A possible explanation for this finding is that the weighting matrix in 2-

step GMM estimators depend on estimated parameters whereas that in 1-step GMM estimators

does not. Windmeijer (2005) proves that in fact a large portion of the finite sample bias of

2-step GMM standard errors is due to the variation of estimated parameters in the weighting

matrix. He derives both a general bias-correction and a specific one for panel data models

with predetermined regressors as in the Arellano and Bond model.


Monte Carlo experiments in Bowsher (2002) show that the Sargan test based on the full

instrument set has zero power when T , and consequently the moment conditions, becomes too

large for given N .

10.10.4. Blundell and Bond (1998) System estimator. Blundell and Bond (1998)

demonstrate that in the presence of � close to unity instruments in levels are weakly correlated

with �y�1

leading to what is known in the econometric literature as a weak instrument bias.

This is easily seen by considering the following example taken from Blundell and Bond. Let

T = 2, then after taking the model in first differences there is only a cross-section available

for estimation:

4yi,2 = �4yi,1 +4✏i,2, i = 1, ..., N.

and only one moment condition

1

N

NX

i=1

(4yi,2 � �4yi,1) yi,0 = 0.

To what extent is yi,0 related to 4yi,1? To answer this question it suffices to work out the

reduced form for 4yi,1:

4yi,1 = (� � 1) yi,0 + ↵i + ✏i,1

from which it is clear that the closer � to unity the weaker the correlation between yi,0 and

4yi,1.

To solve the problem they suggest exploiting the following additional moment restrictions

(10.10.6) E [(yi,t � �yi,t�1)4yi,t�1] = 0, t = 2, ..., T

which are valid if along to Assumption A.1, we maintain that the process for yi,t is mean-

stationary, that is

A.4:

E (yi,0|↵i) =↵i

1� �


Assumption A.4 is justified if the process started in the distant past. Starting from the model

at observation t = 0 and going backward in time recursively

yi,0 = �yi,�1 + ↵i + ✏i,0

= �2yi,�2 + �↵i + ↵i + �✏i,�1 + ✏i,0

= �3yi,�3 + �2↵i + �↵i + ↵i + �2✏i,�2 + �✏i,�1 + ✏i,0

.

.

=

↵i

1� �+

1X

t=0

�t✏i,�t ⌘↵i

1� �+ ui,0

where E (ui,0|↵1) = 0 by Assumption A.1.

That the moment restrictions hold under Assumptions 1 and 4 can be seen for t = 2

E [(yi,2 � �yi,1)4yi,1] =

E [(↵i + ✏i,2) [(� � 1) yi,0 + ↵i + ✏i,1]] =

E

(↵i + ✏i,2)

(� � 1)

✓

↵i

1� �+ ui,0

◆

+ ↵i + ✏i,1

��

=

E [(↵i + ✏i,2) [(� � 1)ui,0 + ✏i,1]] = 0

Thus, Blundell and Bond (1998) suggest a system GMM estimator, which also uses instru-

ments in first differences for the equation in levels.

Hahn (1999) evaluates the efficiency gains brought by exploiting the stationarity of the

initial condition as done by Blundell and Bond, finding that it is substantial also for large T .

Stata’s xtabond performs the Arellano and Bond GMM estimator. Then, there is xtdpdsys,

which implements the GMM system estimator. Third, xtdpd, is a more general command that

allows more flexibility than both xtabond and xtdpdsys. Finally, the user-written xtabond2

(Roodman 2009) is certainly the most powerful code in Stata to implement dynamic panel

data models.


10.10.5. Application. Arellano and Bond (1991) show their methods estimating a dy-

namic employment equation on a sample of UK manufacturing companies. Their data set

in Stata format is contained in abdata.dta. The dofile IV_GMM_DPD.do implements simpler

versions of their model though differenced and system GMM using xtabond and xtabond2.

The dofile abbest.do by D. M. Roodman replicates exactly the Arellano and Bond’s results

using xtabond2.

10.10.6. Bias corrected LSDV. IV and GMM estimators in dynamic panel data models

are consistent for N large, so they can be severely biased and imprecise in panel data with a

small number of cross-sectional units. This certainly applies to most macro panels, but also

micro panels where heterogeneity concerns force the researcher to restrict estimation to small

subsamples of individuals.

Monte Carlo studies (Arellano and Bond 1991, Kiviet 1995 and Judson and Owen 1999)

demonstrate that LSDV although inconsistent has a relatively small variance compared to IV

and GMM estimators. So, an alternative approach based upon the correction of LSDV for the

finite sample bias has recently become popular in the econometric literature. Kiviet (1995)

uses higher order asymptotic expansion techniques to approximate the small sample bias of the

LSDV estimator to include terms of at most order 1/(TN). Monte Carlo evidence therein shows

that the bias-corrected LSDV estimator (LSDVC) often outperforms the IV-GMM estimators

in terms of bias and root mean squared error (RMSE). Another piece of Monte Carlo evidence

by Judson and Owen (1999) strongly supports LSDVC when N is small as in most macro

panels. In Kiviet (1999) the bias expression is more accurate to include terms of higher order.

Bun and Kiviet (2003), simplify the approximations in Kiviet (1999).

Bruno (2005a) extends the bias approximations in Bun and Kiviet (2003) to accommodate

unbalanced panels with a strictly exogenous selection rule.Bruno (2005b) presents the new

user’s written Stata command xtlsdvc to implement LSDVC.


Kiviet (1995) shows that the bias approximations are even more accurate when there is

a unit root in y. This makes for a simple panel unit-root test based on the bootstrapped

standard errors computed by xtlsdvc.

10.10.6.1. Estimating a dynamic labour demand equation for a given industry. Unlike the

xtabond and xtabond2 applications of Subsection 10.10.5, here we do not use all information

available to estimate the parameters of the labour demand equation in abdata.dta. Instead,

we follow a strategy that, exploiting the industry partition of the cross-sectional dimension

as defined by the categorical variable ind, lets the slopes be industry-specific. This is easily

accomplished by restricting the usable data to the panel of firms belonging to a given industry.

While such a strategy leads to a less restrictive specification for the firm labour demand, it

causes a reduced number of cross-sectional units for use in estimation, so that the researcher

must be prepared to deal with a potentially severe small sample bias in any of the industry

regressions. Clearly, xtlsdvc is the appropriate solution in this case.

The demonstration is kept as simple as possible considering regressions for only one in-

dustry panel, ind=4.

The following instructions are implemented in a Stata-do file


Part 2

Non-linear models

CHAPTER 11

Non-linear regression models

11.1. Introduction

Non-linear models present three main difficulties.

(1) Closed-form solutions for estimators are generally not available

(2) Marginal effects do not coincide with the model coefficients and vary over the sample.

(3) Latent heterogeneity components in cross-sections or panel data require special at-

tention.

The are two do-files demonstrating the methods of this chapter: nlmr.do using the data set

mus10data.dta and nlmr2.do using the data set mus17data.dta. Both data-sets are from

Cameron and Trivedi (2009).

11.2. Non-linear least squares

The regression model specifies the mean of y conditional on a vector of exogenous explana-

tory variables x by using some known, non-linear functional form

E (y|x) = µ (x,�) .

Or, equivalently,

y = µ (x,�) + u

where u = y�E (y|x). The non-linear least square estimator, bNLS , minimizes the non-linear

residual sum of squares

Q =

nX

i=1

[yi � µ (x,b)]2 .

170

11.3. POISSON MODEL FOR COUNT DATA 171

11.3. Poisson model for count data

Let y 2 N be a count variable: doctor visits, car accidents, etc. The Poisson regression

model is a non-linear regression model with

(11.3.1) E (y|x) = exp�

x

0��

.

Or, equivalently,

y = exp�

x

0��

+ u

where u = y � E (y|x).

Equation (11.3.1) =) E⇥

y � exp�

x

0��

|x⇤

= 0

and by the Law of Iterated expectations there are zero covariances between u and x:

(11.3.2) Ey,x�

x

⇥

y � exp�

x

0��⇤

= 0.

11.3.1. Estimation. There is a random sample {yi,xi} , i = 1, ..., n, for estimation.

Given the population moment restrictions (11.3.2), estimation can be carried out with a lim-

ited set of assumptions within a GMM set-up: by the analogy principle the consistent GMM

estimator bGMM solve the system of k sample analog restrictions

(11.3.3)nX

i=1

xi⇥

yi � exp�

x

0i��⇤

= 0.

Equations (11.3.3) are also the first-order-conditions in NLS =) bNLS = bGMM in this

case.

Alternatively, we can maintain a Poisson density function for y with mean µ:

f (y) =e�µµy

y!.

Importantly, the Poisson model has the equidispersion property: V ar (y) = E (y) = µ.


Letting µ = exp (x0�) we end up with the conditional log-likelihood function

lnL (y1...yn|x1...xn,�) =

nX

i=1

ln

⇢

exp [�exp (x0i�)] exp (x

0i�)

yi

yi!

�

=

nX

i=1

⇥

�exp�

x

0i��

+ yix0i� � ln (yi!)

⇤

and the ML estimator bML that maximizes it:

bML is consistent : bML !p

�

The covariance matrix estimator of bML :

ˆV (bML) =

nX

i=1

µixix0i

!�1

(11.3.4)

It is easily seen that the k first order conditions that maximize lnL coincide with the equa-

tions in (11.3.3), so that bML = bGMM . This proves two things: 1) The GMM estimator is

asymptotically efficient if the conditional mean function is correctly specified and the density

function is Poisson; 2) the ML estimator is consistent even if the poisson density is not the

correct density function, as long as the conditional mean is correctly specified. In such cases,

when the likelihood function is not correctly specified, we refer to the ML estimator as a

pseudo ML estimator and a robust covariance matrix estimator should be used for inference

rather than (11.3.4):

ˆVrob (bML) =

nX

i=1

µixix0i

!�1 " nX

i=1

(yi � µi)2xix

0i

#

nX

i=1

µixix0i

!�1

.

With equidispersion V ar (y|x) = E (y|x) = µ:

• (yi � µi)2 close to µi =) ˆV (bML) close to ˆVrob (bML)

With overdispersion V ar (y|x) > E (y|x) = µ

• (yi � µi)2 tends to be greater than µi =) ˆV (bML) is inconsistent, with smaller

variance estimates than ˆVrob (bML), which remains consistent.


The consistency result for the (pseudo) ML estimator holds in general if two conditions are

verified:

(1) The conditional mean is correctly specified

(2) The density function belongs to an exponential family

Definition 73. An exponential family of distributions is one whose conditional log-

likelihood function at a generic observation is of the form

lnL (y|x,�) = a (y) + b [µ (x,�)] + yc [µ (x,�)] .

A member of the family is identified by the numerical values of �.

We verify that Poisson is an exponential family:

• a (y) = �ln (y!),

• b [µ (x,�)] = �exp (x0�) and

• yc [µ (x,�)] = yx0�.

The Normal distribution with a known variance �2

� (y|x,�) = 1

�p2⇡

exp

(

� [y � µ (x,�)]2

2�2

)

is an exponential family also:

• a (y) = �ln�

�p2⇡�

� y2/2�2,

• b [µ (x,�)] = �µ (x,�)2 /2�2 and

• yc [µ (x,�)] = yµ (x,�) /�2.

The Stata command that implements poisson regression is poisson, with a syntax close

to regress. It computes bML with standard error estimates obtained by ˆV (bML). If the

vce(robust) option is given, then Stata recognizes the more robust pseudo ML set-up and

still provides the bML coefficient estimates, but with the robust covariance matrix ˆVrob (bML) .

11.4. MODELLING AND TESTING OVERDISPERSION 174

11.4. Modelling and testing overdispersion

We start from a specific Poisson density function, conditional on a random scalar ⌫,

f (y|⌫) = e�µ⌫(µ⌫)y

y!

with E (⌫) = 1 and V ar (⌫) = �2. We can find for the unconditional moments of y applying

iterated expectations

E (y) = E⌫ [E (y|⌫)] = µ

and

V ar (y) = E⌫ [V ar (y|⌫)] + V ar [E (y|⌫)] = E⌫ [µ⌫] + V ar [µ⌫] = µ+ µ2�2= µ

�

1 + µ�2�

> µ

so, overdispersion is allowed.

The marginal density function of y, f (y), is what is needed for ML estimation since ⌫ is

not observable. Its generic expression is

f (y) = E⌫

e�µ⌫(µ⌫)y

y!

�

.

To find it in closed form we need to specify the marginal density function for µ. If ⌫ ⇠

Gamma (1,↵), then f (y) is a negative binomial density function, NB�

µ,�2�

, with E (y) = µ

and V ar (y) = µ�

1 + µ�2�

. Clearly if �2= 0, then ⌫ collapses to its unity mean and f (y) is

Poisson.

Specifying µ = exp (x0�) yields the NB regression model and � and �2 are estimated via

ML based on NB⇥

exp (x0�) ,�2⇤

. Testing for overdispersion within this framework boils down

to testing the null hypothesis �2= 0.

The Stata command that implements the NB regression is nbreg, with a syntax close to

regress and poisson. The output also gives the overdispersion (LR) test of �2= 0.

11.4. MODELLING AND TESTING OVERDISPERSION 175

Overdispersion can be tested also under the null hypothesis of �2= 0, therefore under

Poisson regression, against the alternative of V ar (y|x) = µ�

1 + µ�2�

, therefore NB regres-

sion, using a Lagrange Multiplier test. This is based on an auxiliary regression implemented

after poisson estimation using an estimate of [V ar (y|x) /µ]� 1,h

(yi � µi)2 � yi

i

/µi, as the

dependent variable and µi = exp (x0ibML), as the only regressor (no constant). The LM test

is the t-statistic computed for the OLS coefficient estimate of µi.

CHAPTER 12

Binary dependent variable models

12.1. Introduction

Binary dependent variable models have a dependent variable that partitions the sample

into two categories of a given qualitative dimension of interest. For example

• Labour supply. There are two categories: work/not work (univariate binary model).

• Supplementary private health insurance. There are two categories: purchase/not

purchase (univariate binary model)

Binary models are said multivariate when there are multiple dimensions that are possibly

related

• Two related dimensions: [Dimension 1: Being overweight (body mass index > 25)

=) Two categories: yes/not] and [Dimension 2: Job satisfaction =) Two categories:

satisfied/dissatisfied] (bivariate binary model).

• Two related dimensions: [Dimension 1: Identity of immigrants with the host country

=) Two categories: yes/not] and [Dimension 2: Identity of immigrants with the

country of origin =) Two categories: yes/not] (bivariate binary model).

In these notes I focus almost exclusively with univariate binary models, except for a digression

on the bivariate probit model as estimated by Stata’s biprobit.

The do-file bdvm.do is a Stata application on binary models that uses the data set mus14data.dta

from Cameron and Trivedi (2009).176

12.2. BINARY MODELS 177

12.2. Binary models

Let A the event of interest (e.g. “an immigrant identifies with the host-country culture”).

Let the indicator function 1 (A) be unity if event A occurs and zero if not. Define the discrete

random variable y such that

(12.2.1) y = 1 (A) .

Then

• Pr (y = 1) = Pr (A) ⌘ ⇢ and Pr (y = 0) = 1� ⇢.

• E (y) = ⇢ and V ar (y) = ⇢ (1� ⇢) .

We want to asses the impact of x on the probability of A and to do so we model Pr (y = 1|x)

as a function of x.

Since 0 Pr (y = 1|x) 1 a suitable functional form for Pr (y = 1|x) is any cumulative

distribution function evaluated at a linear combination of x, F (x

0�). Accordingly, we specify

(12.2.2) Pr (y = 1|x) = F�

x

0��

.

Two popular choices for F (·) are

• Probit Model: F (·) ⌘ � (·), the Standard Normal distribution

• Logit model: F (·) ⌘ ⇤ (·) ⌘ exp (x0�) / [1 + exp (x0�)] , the Logistic distribution

with zero mean and variance ⇡2/3.

Alternatively, we may model F (·) directly as a linear function of x:

• Linear Probability Model (LPM): F (x

0�) ⌘ x

0�.

Since Pr (y = 1|x) = E (y|x), Model (12.2.2) can always be expressed as the regression model

(12.2.3) y = F�

x

0��

+ u

u = y � E (y|x) .


12.2.1. Latent regression. When F (·) is a distribution function the binary model can

be motivated as a latent regression model. In microeconomics this is a convenient way to

model individual choices.

Introduce the latent continuous random variable y⇤ with

(12.2.4) y⇤ = x

0� + ",

let " be a zero mean random variable that is independent from x and with " ⇠ F , where

F is a distribution function that is symmetric around zero. Then, let y = 1 (y⇤ > 0) . In

our example of Immigrant identity we may think of y⇤ as the utility variation faced by a

subject with observable and latent characteristics x and ", respectively, when he/she decides

to conform to the host-country culture, so that event A occurs if and only if y⇤ > 0.

(12.2.5) =) y = 1

�

" > �x

0��

,

=) Pr (y = 1|x) = Pr�

" > �x

0�|x�

Since " and x are independent Pr (" x

0�|x) = F (x

0�). Moreover, by symmetry of F ,

Pr (" > �x

0�|x) = F (x

0�) and so

Pr (y = 1|x) = F�

x

0��

,

which is exactly Model (12.2.2).

Inspection of (12.2.5) clarifies that V ar (") = �2 and � cannot be separately identified,

since Pr (" > �x

0�) = Pr [("/�) > �x

0(�/�)]. Therefore, to identify �, �2 must be fixed to

some known value. In the probit model �2= 1 and in the logit model �2

= ⇡2/3.


12.2.2. Estimation. There is a random sample {yi,xi} , i = 1, ..., n, for estimation. In

the logit and probit models estimation is carried out via ML. The ML estimator, bML maxi-

mizes the conditional log-likelihood function

lnL (y1...yn|x1...xn,�) =

nX

i=1

�

yiln⇥

F�

x

0i��⇤

+ (1� yi) ln⇥

1� F�

x

0i��⇤

.

We have

bML !p

�

ˆV (bML) =

(

nX

i=1

f (x

0ibML)xix

0i

F (x

0ibML) [1� F (x

0ibML)]

)�1

(12.2.6)

where f is the density function of F and remember that @xF (x) = f (x).

The Stata commands that compute bML and ˆV (bML) in the probit and logit models are,

respectively, probit and logit. The syntax is similar to regress.

The LPM assumes F = X�. So, Equation (12.2.3) is a linear regression model that can be

estimated by regress. In this case the model coefficients are identical to the marginal effects

of interest. But V ar (u|x) = V ar (y|x) = x

0� (1� x

0�), so the model is heteroskedastic and

regress should be supplemented by the vce(robust) option.

12.2.3. Heteroskedasticity. Unlike the non-linear models examined in Chapter 11, in

probit and logit models heteroskedasticity brings about misspecification of the conditional

mean, so that ML estimators of both models become inconsistent. Hence, it makes little sense

to complement probit and logit coefficient estimates with heteroskedasticity-robust standard

error estimates.

Heteroskedasticity can be modeled, though. In the probit model, instead of fixing �2= 1,

one can allow heteroskedasticity by setting �2i = exp (z0i�) , so that

(12.2.7) Pr (yi = 1|x) = �

�

x

0i�/exp

�

z

0i��

.

12.3. COEFFICIENT ESTIMATES AND MARGINAL EFFECTS 180

Stata’s hetprobit estimates this heteroskedastic probit model and, importantly, provides a

LR test for the null of homoskedasticity (�=0).

12.2.4. Clustering. Differently from heteroskedasticity, it makes sense to adjust stan-

dard error estimates to within-cluster correlation. This is the case since within-cluster correla-

tion leaves the conditional expectation of an individual observation unaffected, so that the ML

estimator can be motivated as a partial ML estimator, which remains consistent even if ob-

servations are not independent (see Wooldridge 2010, p. 609). The Stata option vce(cluster

clustervar), therefore, can be conveniently included in both probit and logit statements.

12.3. Coefficient estimates and marginal effects

There is no exact relationship between the coefficient estimates from the three foregoing

models. Amemiya (1981) works out the following rough conversion factors

blogit ' 4bols

bprobit ' 2.5bols

blogit ' 1.6bprobit.

The marginal effect of x at observation i are estimated by logit and probit as

(@xFi)probit = fprobit,ibprobit = ��

x

0ibprobit

�

bprobit

(@xFi)logit = flogit,iblogit = ⇤

�

x

0ibprobit

� ⇥

1� ⇤

�

x

0ibprobit

�⇤

blogit

and by LPM as

(@xFi)ols = bols.

The following relationships hold

(@xF )logit 0.25bols

(@xF )probit 0.4bols

12.4. TESTS AND GOODNESS OF FIT MEASURES 181

The post-estimation command margins with the option dydx(varlist) estimates marginal

effects for each of the variables in varlist. Marginal effects can be estimated at a point x

(conventionally, the sample mean when variables are continuous and in this case the option

atmean must be supplied) or can be averaged over the sample (default).

12.4. Tests and Goodness of fit measures

Parameter restrictions can be tested by Wald tests (test) and LR tests (lrtest). As ex-

plained above, hetprobit, besides producing coefficient estimates, provides an heteroskedas-

ticity test.

The most common Goodness of fit measures reported in logit or probit outputs are:

• The overall percent correctly predicted (opcp). Define the predictor yi of yi as

yi =

8

<

:

1 if F⇣

x

0ˆ�⌘

� 0.5

0 else

9

=

;

The opcp is given by the number of times yi = yi over n. A problem with this measure

is that it can be high also in cases where the model poorly predicts one outcome. It

may be more informative in these cases to compute the percent correctly predicted

for each outcome separately: 1) the number of times yi = yi = 1 over the number of

times yi = 1 and 2) the number of times yi = yi = 0 over the number of times yi = 0.

This is done through the post-estimation command estat classification.

• Test the discrepancy of the actual frequency of an outcome and the estimated aver-

age probability of the same outcome within a subsample S of interest (for example,

females in a sample of workers)

yS ⌘ 1

nS

X

i2Syi vs. pS ⌘ 1

nS

X

i2SF⇣

x

0iˆ�⌘

Doing this on the whole sample makes little sense because the two measures are

always very close (equal in the logit model with the intercept).

12.6. INDEPENDENT LATENT HETEROGENEITY 182

• Evaluate the Pseudo R-squared: ˜R2= 1 � L (�) /L (y) , where L (�) is the value of

the maximized log-likelihood and L (y) is the log-likelihood evaluated for the model

with only the intercept. Always 0 < ˜R2 < 1.

12.5. Endogenous regressors

In the presence of a continuous endogenous regressors in the latent regression model one

can use an instrumental variable probit estimator. This is implemented by ivprobit, with a

syntax similar to ivregress. When the endogenous regressor is binary, then we can apply a

bivariate recursive probit model as explained in Subsection 12.7.1.

12.6. Independent latent heterogeneity

In the latent regression model (12.2.4) all explanatory variables are observed. But it

may be the case that relevant explanatory variables are latent, as allowed by the following

specification of the model

y⇤ = x

0� +w

0�w + ",

where the w’s are latent variables. There is so a latent heterogeneity component ↵ ⌘ w

0�w

to consider in the model along with ". We make the following assumptions

• ↵|x, " ⇠ N�

0,�2�

Then ↵+ "|x ⇠ N�

0, 1 + �2�

and

y⇤p1 + �2

= x

0 �p1 + �2

+

↵+ "p1 + �2

,

is a legitimate probit model. In fact, y⇤/p1 + �2 is latent,

↵+ "p1 + �2

|x ⇠ N (0, 1)

and so

�

✓

x

0 �p1 + �2

◆

= Pr (y = 1|x) .

12.7. MULTIVARIATE PROBIT MODELS 183

It follows that we can apply standard probit ML estimation and the resulting estimator\�/p1 + �2 is consistent for �/

p1 + �2 and so is �

✓

x

0 \�p1+�2

◆

for the response probabilities

Pr (y = 1|x) .

From the above analysis it clearly emerges that \�/p1 + �2 estimates � with a downward

bias (Yatchew and Griliches (1985)). Nonetheless, if our interest centers on marginal effects

@xPr (y|↵,x) averaged over ↵ (AME’s), E↵ [@xPr (y|↵,x)], this is no problem.

Indeed, given f (↵|x) the conditional density function of ↵, it is generally true that

Pr (y|x) =ˆ

↵|x

Pr (y|x,↵) f (↵|x) d↵

But since ↵ and x are independent f (↵|x) = f (↵) and so

Pr (y|x) =ˆ↵

Pr (y|x,↵) f (↵) d↵ = E↵ [Pr (y|↵,x)]

Hence, under mild regularity conditions that permit interchanging integrals and derivatives,

@xPr (y|x) = E↵ [@xPr (y|↵,x)]

The above result is important, for it establishes that to estimate Pr (y|x) and @xPr (y|x)

is to estimate E↵ [Pr (y|↵,x)] and E↵ [@xPr (y|↵,x)], respectively. So, �

✓

x

0 \�p1+�2

◆

is a

consistent estimator for E↵ [Pr (y|↵,x)], likewise @x�

✓

x

0 \�p1+�2

◆

is a consistent estimator of

E↵ [@xPr (y|↵,x)] (see Wooldridge (2005) and Wooldridge (2010)).

If evaluated at a given point x

0, the AMEs are averages over ↵ alone. To estimate

Ex,↵ [Pr (y|↵,x)] and E

x,↵ [@xPr (y|↵,x)] just average �✓

x

0i\�p1+�2

◆

and @x�

✓

x

0i\�p1+�2

◆

over

the sample.

12.7. Multivariate probit models

Multivariate probit models can be conveniently represented using the latent-regression

framework. There are m binary variables, y1, , y2... ym that may be related.


Multivariate probit models are constructed by supplementing the random vector y defined

in (12.2.1) with the latent regression model

(12.7.1) y⇤j = x

0�j + "j

j = 1, ...,m, and �j , x and "j are, respectively, the p⇥1 vectors of parameters and explanatory

variables and the error term. Stacking all "j ’s into the vector " ⌘ ("1, ... , "m)

0, we assume

"|x ⇠ N (0, R). The covariance matrix R is subject to normalization restrictions that will be

made explicit below. Equation specific regressors are accommodated by allowing �j to have

zeroes in the positions of the variables in x that are excluded from equation j. Cross-equation

restrictions on the �’s are also permitted. R is normalized for scale and so has unity diagonal

elements and arbitrary off-diagonal elements, ⇢ij , which allows for possible cross-equation

correlation of errors. It may or may not present constraints beyond normalization. If m = 2

we have the bivariate probit model, which is estimated by the Stata command biprobit, with

a syntax similar to probit.

12.7.1. Recursive models. An interesting class of multivariate probit models is that

of the recursive models. In recursive probit models the variables in y =(y1, , y2... ym) are

allowed as right-hand-side variables of the latent system provided that the m ⇥m matrix of

coefficients on y is restricted to be triangular (Roodman 2011). This means that if the model

is bivariate, the latent system is

y⇤1 = x�1 + �y2 + "1

y⇤2 = x�2 + "2(12.7.2)

It is then evident that estimating a bivariate recursive probit model is ancillary to estimation

of a univariate probit model with a binary endogenous regressor, the first equation of system

(12.7.2).


The feature that makes the recursive multivariate probit model appealing is that it accom-

modates endogenous, binary explanatory variables without special provisions for endogeneity,

simply maximizing the log-likelihood function as if the explanatory variables were all ordinary

exogenous variables (see Maddala 1983, Wooldridge 2010,Greene 2012 and, for a general proof,

Roodman 2011). This can be easily seen here in the case of the recursive bivariate model

Pr (y1 = 1, y2 = 1|x) = Pr (y1 = 1|y2 = 1,x)P (y2 = 1|x)

= Pr⇥

"1 > �x

0�1 � �|y2 = 1,x⇤

P [y2 = 1|x]

= Pr⇥

"1 > �x

0�1 � �|"2 > �x

0�2,x⇤

P⇥

"2 > �x

0�2|x⇤

= Pr⇥

"1 > �x

0�1 � �, "2 > �x

0�2|x⇤

= �2�

x

0�1 + �,x0�2

�

The crux of the above derivations is that, given

y1 = 1

�

"1 > �x

0�1 � �y2�

and y2 = 1

�

"2 > �x

0�2

�

,

"1 is independent of the lower limit of integration conditional on "2 > �x

0�2 and so no endo-

geneity issue emerges when working out the joint probability as a joint normal distribution.

The other three joint probabilities are similarly derived, so that eventually the likelihood

function is assembled exactly as in a conventional multivariate probit model1.

Starting with the contributions of Evans and Schwab (1995) and Greene (1998), there

are by now many econometric applications of this model, including the recent articles by

Fichera and Sutton (2011) and Entorf (2012). The user-written command mvprobit deals

with m > 2, it evaluates multiple integrals by simulation (see Cappellari and Jenkins (2003)).

1Wooldridge 2010 argues that, although not strictly necessary for formal identification, substantial identification

in recursive models may require exclusion restrictions in the equations of interest. For example, in system

(12.7.2) substantial identification requires some zeroes in �1, where the corresponding variables may then be

thought of as instruments for y2.


The recent user-written command cmp (see Roodman (2011)) is a more general sumulation-

based procedure that can estimate many multiple-response and multivariate models.

CHAPTER 13

Censored and selection models

13.1. Introduction

• Censored models (Tobit models): y has lower and/or upper limits

• Selection models: some values of y are missing not at random.

13.2. Tobit models

Consider the latent regression model

y⇤ = x

0� + ",

with "|x ⇠ N�

0,�2�

. y is an observed random variable such that

y =

8

>

<

>

:

y = y⇤ if y⇤ > L

y = L if y⇤ L

where L is a known lower limit.

Think of a utility maximizer individual with latent and observable characteristics " and

x, respectively, choosing y subject to the inequality constraint y � L, with y⇤ denoting the

unconstrained solution. For a part of individuals the constraint is binding (y = L ) and for

the others is not (y > L). Focussing on the former subpopulation, the regression model is

y = E (y|x, y > L) + u

= E�

x

0� + "|x, " > L� x

0��

+ u

x

0� + E�

"|x, " > L� x

0��

+ u(13.2.1)187

13.2. TOBIT MODELS 188

where u = y � E (y|x, y > L) . The following result for the density and moments of the trun-

cated normal distribution are useful (see Greene 2012, pp. 874-876):

f (z|z > ↵) =

1

��

✓

z � µ

�

◆

/ {1� � [(↵� µ) /�]}

f (z|z < ↵) =

1

��

✓

z � µ

�

◆

/� [(↵� µ) /�]

E (z|z > ↵) = µ+ �� [(↵� µ) /�]

1� � [(↵� µ) /�]

E (z|z < ↵) = µ� �� [(↵� µ) /�]

� [(↵� µ) /�].

The foregoing equalities are all based on the following representations of general cumulative

distribution function, F(µ,�2) :

F(µ,�2) (↵) = Pr [(z � µ) /� (↵� µ) /�] = F(0,1) [(↵� µ) /�]

and general normal densities �(µ,�2) (z) :

�(µ,�2) (z) =1

�p2⇡

exp

"

�(z � µ)2

2�2

#

=

1

��

✓

z � µ

�

◆

.

Then, Model (13.2.1) can be written in closed form as

y = x

0� + �� [(L� x

0�) /�]

1� � [(L� x

0�) /�]+ u.

By symmetry of the normal distribution,

(13.2.2) y = x

0� + �� [(x

0� � L) /�]

� [(x

0� � L) /�]+ u.

If L = 0, the foregoing reduces to

y = x

0� + �� [(x

0�) /�]

� [(x

0�) /�]+ u.

13.2. TOBIT MODELS 189

13.2.1. Estimation. There is a random sample {yi,xi} , i = 1, ..., n, for estimation. Let

di = 1 (yi > L). Estimation can be via ML or two-step LS.

The log-likelihood function assembles the density functions peculiar to the subsample of

individuals di = 1 and those peculiar to individuals di = 0 (left-censored). For an individual

di = 1, yi = y⇤i and we know that y⇤i |xi ⇠ N�

x

0i�,�

2�

. Therefore, we can evaluate the density

at the single point yi � x

0i�

f (yi|x) =1

��

✓

yi � x

0i�

�

◆

,

For a left-censored individual (di = 0), all we know is that "i L� x

0i�, so we integrate the

density over the interval "i L � x

0i� to get Pr (yi = L|xi) = � [(L� x

0i�) /�]. Therefore,

the log-likelihood function is

lnL (y1...yn|x1...xn,�) =

nX

i=1

⇢

diln

1

��

✓

yi � x

0i�

�

◆�

+ (1� di) ln �

(L� x

0i�)

�

��

.

The ML estimator bML is consistent for �, asymptotically normal and asymptotically efficient.

Two-step LS, b2�step, is based on Equation (13.2.2). In the first step we apply a probit

regression using di as the dependent variable to estimate � [(x

0� � L) /�] /� [(x

0� � L) /�]

as \�i/�i ⌘ � (x

0bprobit) /� (x

0bprobit) (recall that bprobit is indeed a consistent estimate of

�/� and L/� is subsumed in the constant term). In the second step apply OLS regression

of yi on xi and \�i/�i restricting to the unconstrained subsample di = 1. using yi. b2�step

is consistent but standard errors needs to be adjusted since in the second step there is an

estimated regressor.

Upper limits can be dealt similarly:

y =

8

>

<

>

:

y = y⇤ if y⇤ < U

y = U if y⇤ � U.

13.3. A SIMPLE SELECTION MODEL 190

Also, lower and upper limits jointly:

y =

8

>

>

>

>

>

<

>

>

>

>

>

:

y = L if y⇤ L

y = y⇤ if L < y⇤ < U

y = U if y⇤ � U

The Stata command that compute bML in the tobit model is tobit. The syntax is

similar to regress, requiring in addition options specifying lower limits, ll(#), and upper

limits, ul(#).

Marginal effects of interest are

• @xE (y⇤|x) = �

• @xE (y|x, y > L) =

�

1� w� (w)� �2(w)

�, where w = (x

0� � L) /� and � (w) =

� [(x

0� � L) /�] /� [(x

0� � L) /�] .

• @xE (y|x, y > L) = � (w)�.

These are computed by margins and the older mfx.

13.2.2. Heteroskedasticy and clustering. The same consideration made for binary

models in Sections 12.2.3 and 12.2.4 hold here. While heteroskedasticty breaks down the

specification of conditional expectations, clustering does not. Therefore, it makes sense to

apply the Stata option vce(cluster clustervar).

13.3. A simple selection model

Two processes: the first select the units into the sample, the second generates y. The two

processes are related: selection is endogenous: it cannot be ignored!

Selection process:

s⇤ = z

0� + ⌘,

13.3. A SIMPLE SELECTION MODEL 191

d = 1 (s⇤ > 0)

Process for y:

y⇤ = x

0� + ",

y =

8

>

<

>

:

y = y⇤ if d = 1

y = missing if d = 0

Interest is on �. The two processes are related for " and ⌘ are:

0

@

⌘

"

1

A |z,x ⇠ N

2

4

0

@

0

0

1

A ,

0

@

1 �⌘"

�⌘" �2"

1

A

3

5

Estimation is via ML. The log-likelihood is

lnL =

nX

i=1

{diln [f (yi|di = 1)Pr (di = 1)] + (1� di) ln [Pr (di = 0)]} .

The Stata command that compute bML in the selection model is heckman. The syntax

is similar to regress, requiring in addition an option specifying the list of variables in the

selection process, d and z: select(varlist_s ).

CHAPTER 14

Quantile regression

14.1. Introduction

Define the conditional c.d.f. of Y : F (y|x) = Pr (Y y|x). Instead of E (y|x), as in the

CRM, we model quantiles of F (y|x) .

The conditional median function of y, Q0.5 (y|x) , is an example. Specifically, for given x

and F (y|x) , Q0.5 (y|x) is a function that assigns the median of F (y|x) to x, and is implicitly

defined as

F [Q0.5 (y|x) |x] = 0.5

or, explicitly,

(14.1.1) Q0.5 (y|x) = F�1(0.5|x) .

More generally, given, the quantile q 2 (0, 1), define the conditional quantile function, Qq (x),

as

(14.1.2) Qq (y|x) = F�1(q|x) .

14.2. Properties of conditional quantiles

Define a predictor function y (x) and let " (y,x) be the corresponding prediction error

" (y,x) = y � y (x)

then

• Lq = qE"(y,x)�0 [" (y,x)]+(q � 1)E"(y,x)<0 [" (y,x)] is minimized when y (x) = Qq (y|x) .192

14.3. ESTIMATION 193

• Qq (y|x) is equivariant to monotone transformations: Let h (·) be a monotonic func-

tion, then Qq [h (y) |x] = h [Qq (y|x)]

In the case of the median:

L0.5 = Ey,x (|" (y,x)|) =) Q0.5 (y|x) minimizes Ey,x (|" (y,x)|)

In other words Q0.5 (y|x) is the minimum mean absolute error predictor.

14.3. Estimation

There is a sample {yi,xi} , i = 1, ..., n, for estimation.

14.3.1. Marginal effects. We can put the QR model in close relationship with the CRM.

Let

yi = E (yi|xi) + ui

where ui = yi � E (yi|xi), then

Qq (yi|xi) = E (yi|xi) +Qq (ui|xi) .

This is proved by invoking the equivariant property of Qq (·|x):

Qq (yi|xi) = Qq [E (yi|xi) + ui|xi] = E (yi|xi) +Qq (ui|xi) .

14.3.1.1. i.i.d. case. If {ui} , i = 1, ..., n are i.i.d. and independent of xi then Qq (ui|xi) is

constant over the sample and varies only with q, Qq (ui|xi) = �q, so that

Qq (yi|xi) = E (yi|xi) + �q,

which implies that here the marginal effects computed by the regression model coincides with

those computed from the QR. Therefore

@x

Qq (yi|xi) = @x

E (yi|xi)

14.3. ESTIMATION 194

and

E (yi|xi) = x

0i� =) @

x

Qq (yi|xi) = �.

14.3.1.2. General case. If the i.i.d. assumptions does not hold (e.g. because of het-

eroskedasticity), then

@x

Qq (yi|xi) = @x

E (yi|xi) + @x

Qq (ui|xi)

and

E (yi|xi) = x

0i� =) @

x

Qq (yi|xi) = � + @x

Qq (ui|xi) .

Also,

(14.3.1) E (yi|xi) = x

0i� and Qq (ui|xi) = x

0i�q =) @

x

Qq (yi|xi) = � + �q.

14.3.2. The linear QR. The linear quantile regression model specifies Qq (yi|xi) as a

linear function

Qq (yi|xi) = x

0i�q

or equivalently

yi = x

0i�q + uq,i

where uq,i = yi � x

0i�q and from (14.3.1), �q = � + �q. A consistent estimator, bq, for �q is

found by the analogy principle:

bq = argminb

8

<

:

qX

yi

�x

0i

b

�

yi � x

0ib�

+ (q � 1)

X

yi

<x

0i

b

�

yi � x

0ib�

9

=

;

.

Under mild regularity conditions

bqa⇠ N

�

�q, A�1BA�1

�

where A =

P

i q (1� q)xix0i , B =

P

i fuq

(0|xi)xix0i and fu

q

(0|xi) is the conditional density

of uq,i evaluated at uq,i = 0. The latter makes A�1BA�1 difficult to estimate, better apply

conventional bootstrap procedures.

14.4. AN HETEROSKEDASTIC REGRESSION MODEL WITH SIMULATED DATA 195

The main Stata command implementing QR is qreg. Its syntax is similar to �regress.

The quantile(.##) option in qreg indicates the quantile of choice (e.g. to get the median,

which is also the default, set .##=.50) To produce QR estimates with bootstrap standard

errors apply bsqreg. The reps(#) in bsqreg indicates the number of bootstrap replications.

Implementing the same model at various quantiles through repeated qreg regressions can

shed light on the discrepancies in behavior across different regions of the variable of interest. To

evaluate the statistical significance of such discrepancies, though, it is necessary to estimate

a larger covariance matrix encompassing covariances between coefficient estimators across

quantiles. This is done by sqreg, which provides simultaneous estimates from the quantile

regressions chosen by the researcher.

14.4. An heteroskedastic regression model with simulated data

This is based on ch. 7.4 in Cameron and Trivedi (2009) and clarifies a couple of somewhat

intricate points therein.

Generate the data from an heteroskedastic linear regression model:

y = 1 + x2 + x3 + u,

u = (0.1 + 0.5x2) ",

x2 ⇠ �2(1) ,

x3|x2 ⇠ N (0, 25) ,

"|x2, x3 ⇠ N (0, 25) .

It is not hard to verify that E (y|x) = 1 + x2 + x3. Therefore,

@x2E (y|xi) = 1

@x3E (y|xi) = 1.

14.4. AN HETEROSKEDASTIC REGRESSION MODEL WITH SIMULATED DATA 196

and we expect that an OLS regression will yield coefficient estimates close to the foregoing

marginal effects. Also,

Qq (u|x) = Qq [(0.1 + 0.5x2) "|x] = (0.1 + 0.5x2)Qq ("|x) = 0.1�q + 0.5�qx2

where the first equality is obvious, the second follows from the equivariance property and the

last from independence of " and x2 yielding �q = Qq ("|x). Then, according to (14.3.1) we

have

@x2Qq (y|xi) = 1 + 0.5�q

@x3Qq (y|xi) = 1.

Therefore, we expect that quantile regressions will yield coefficient estimates for x3 close to

the OLS estimate, regardless of the quantile considered, whilst for x2 we will observe various

discrepancies with the OLS estimate depending on the quantile regression. This is confirmed

by the outcome of the dofile qr.do.

Bibliography

Abowd, J. M., Kramarz, F., Margolis, D. N., 1999. High wage workers and high wage firms.

Econometrica 67, 251–333.

Anderson, T. W., Hsiao, C., 1982. Formulation and estimation of dynamic models using panel

data. Journal of Econometrics 18, 570–606.

Andrews, D. W. K., Moreira, M. J., Stock, J. H., 2007. Performance of conditional wald tests

in iv regression with weak instruments. Journal of Econometrics 139, 116–132.

Arellano, M., 1987. Computing robust standard errors for within-groups estimators. Oxford

Bulletin of Economics and Statistics 49 (4), 431–34.

Arellano, M., 2003. Panel Data Econometrics. Oxford University Press.

Arellano, M., Bond, S., 1991. Some tests of specification for panel data: Monte carlo evidence

and an application to employment equations. Review of Economic Studies 58, 277–297.

Baltagi, B. H., 2008. Econometric Analysis of Panel Data. New York: Wiley.

Blundell, R., Bond, S., 1998. Initial conditions and moment restrictions in dynamic panel data

models. Journal of Econometrics 87, 115–143.

Bowsher, C. G., 2002. On testing overidentifying restrictions in dynamic panel data models.

Economics Letters 77, 211–220.

Bruno, G. S. F., 2005a. Approximating the bias of the lsdv estimator for dynamic unbalanced

panel data models. Economics Letters 87, 361–366.

Bruno, G. S. F., 2005b. Estimation and inference in dynamic unbalanced panel data models

with a small number of individuals. The Stata Journal 5, 473–00.

Bun, M. J. G., Kiviet, J. F., 2003. On the diminishing returns of higher order terms in

asymptotic expansions of bias. Economics Letters 79, 145–152.

197

Bibliography 198

Cameron, A. C., Gelbach, J. B., Miller, D. L., 2011. Robust inference with multiway clustering.

Journal of Business & Economic Statistics 29, 238–249.

Cameron, A. C., Trivedi, P. K., 2009. Microeconometrics using Stata. Stata Press, College

Station, TX.

Cappellari, L., Jenkins, S. P., 2003. Multivariate probit regression using simulated maximum

likelihood. The Stata Journal 3, 278–294.

Cragg, J., Donald, S., 1993. Testing identfiability and specification in instrumental variables

models. econometric theory, vol. 9, pp. Econometric Theory, 222–240.

Entorf, H., 2012. Expected recidivism among young offenders: Comparing specific deterrence

under juvenile and adult criminal law. European Journal of Political Economy 28, 414–429.

Evans, W. N., Schwab, R. M., 1995. Finishing high school and starting college: Do catholic

schools make a difference? The Quarterly Journal of Economics 110, 941–974.

Fichera, E., Sutton, M., 2011. State and self investment in health. Journal of Health Economics

30, 1164–1173.

Greene, W. H., 1998. Gender economics courses in liberal arts colleges: Further results. Journal

of Economic Education 29, 291–300.

Greene, W. H., 2008. Econometric Analysis, sixth Edition. Upper Saddle River, NJ: Prentice

Hall.

Greene, W. H., 2012. Econometric Analysis, seventh Edition. Upper Saddle River, NJ: Prentice

Hall.

Hansen, L. P., 1982. Large sample properties of generalized method of moments estimators.

Econometrica 50 (4), 1029–1054.

Hausman, J., 1978. Specification tests in econometrics. Econometrica 46, 1251–1271.

Hausman, J. A., Taylor, W., 1981. Panel data models and unobservable individual effects.

Econometrica 49, 1377–1398.

Hayashi, F., 2000. Econometrics. Princeton University Press.

Bibliography 199

Judson, R. A., Owen, A. L., 1999. Estimating dynamic panel data models: a guide for macroe-

conomists. Economics Letters 65, 9–15.

Kiviet, J. F., 1995. On bias, inconsistency and efficiency of various estimators in dynamic

panel data models. Journal of Econometrics 68, 53–78.

Kiviet, J. F., 1999. Expectation of expansions for estimators in a dynamic panel data model;

some results for weakly exogenous regressors. In: C. Hsiao, K. Lahiri, L.-F. L., Pesaran,

M. H. (Eds.), Analysis of Panels and Limited Dependent Variable Models. Cambridge Uni-

versity Press, Cambridge, pp. 199–225.

Maddala, G. S., 1983. Limited-Dependent and Qualitative Variables in Econometrics. Cam-

bridge University Press, Cambridge.

Mikusheva, A., Poi, B. P., 2006. Tests and confidence sets with correct size when instruments

are potentially weak. The Stata Journal 6, 335–347.

Moulton, B. R., 1990. An illustration of a pitfall in estimating the effects of aggregate variables

on micro units. The Review of Economics and Statistics 72 (2), 334–38.

Mundlak, Y., 1978. On the pooling of time series and cross section data. Econometrica 46,

69–85.

Nickell, S. J., 1981. Biases in dynamic models with fixed effects. Econometrica 49, 1417–1426.

Prucha, I. R., 1984. On the asymptotic efficiency of feasible aitken estimators for seemingly

unrelated regression models with error components. Econometrica 52, 203–207.

Rao, C. R., 1973. Linear Statistical Inference and Its Applications. New York: Wiley.

Roodman, D. M., 2009. How to do xtabond2: An introduction to difference and system gmm

in stata. The Stata Journal 9 (1), 86–136.

Roodman, D. M., 2011. Fitting fully observed recursive mixed-process models with cmp. The

Stata Journal 11, 159–206.

Searle, S. R., 1982. Matrix Algebra Useful for Statistics. New York: Wiley.

Stock, J. H., Watson, M. W., 2008. Heteroskedasticity-robust standard errors for fixed effects

panel data regression. Econometrica 76, 155–74.

Bibliography 200

Swamy, P. A. B., Arora, S. S., 1972. The exact finite sample properties of the estimators of

coefficients in the error components regression models. Econometrica 40 (2), 261–275.

White, H., 2001. Asymptotic Theory for Econometricians, revised edition Edition. Emerald.

Windmeijer, F., 2005. A finite sample correction for the variance of linear efficient two-step

gmm estimators. Journal of Econometrics 126, 25–51.

Wooldridge, J. M., 2005. Unobserved heterogeneity and estimation of average partial effects.

In: Andrews, D. W. K., Stock, J. H. (Eds.), Identification And Inference For Econometric

Models: Essays In Honor Of Thomas Rothenberg. Cambridge University Press, New York.

Wooldridge, J. M., 2010. Econometric Analysis of Cross Section and Panel Data, 2nd Edition.

The MIT Press, Cambridge, MA.

Yatchew, A., Griliches, Z., 1985. Specification error in probit models. Review of Economics

and Statistics 67, 134–139.

Zyskind, G., 1967. On canonical forms, non-negative covariance matrices and best and simple

least squares linear estimators in linear models. Annals of Mathematical Statistics 36, 1092–

09.

Documents

Econometrics