Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Lecture 5: Missing values and SpatialProbit models
James P. LeSage
University of Toledo
Department of Economics
Toledo, OH 43606
October 24, 2005
Motivation
• County auditors need to produce estimates of market value for
unsold homes for real estate tax assessment.
• Using a sample ys of sales prices and Xs of home
characteristics, a hedonic price regression is typically used
to produce β based on the model: ys = Xsβ + us, e.g.,
�ys
yu
�=
�Xs
Xu
�β +
�εs
εu
�(1)
• Assessments are made using: yu = Xuβ.
• β = (X ′sXs)
−1X ′sys
• In this “missing completely at random”, MCAR setting
information on Xu cannot be used to improve on the estimate
β, Little (1992), Little and Rubin (1987) and Rao and
Toutenburg (1995).
• Intuitively, under independence any subset of the data should
work equally well to estimate the parameters of the model.
Knowledge of Xu becomes redundant information.
1
Spatial regression models
• We note that in this setting more appropriate models are:
SAR: y = ρWy + Xβ + ε (2)
SDM: y = ρWy + Xβ + WXoγ + ε (3)
SEM: y = Xβ + ν (4)
SEM: ν = ρWν + ε
ε ∼ N(0, σ2)
• Where:W is an n by n non-negative spatial weight matrix with
zeros on the diagonal Wij > 0 for observations j = 1, . . . , n
sufficiently close to observation i, andPn
j=1 Wij = 1.
• The scalar parameter ρ reflects the magnitude of spatial
dependence.
• The parameters to be estimated in the SAR and SEM models
are: β, σ2ε , and ρ, with the additional parameter vector γ
added to these for the SDM model.
2
• We can partition the SEM model into sold and unsold sub-
samples,
�ys
yu
�=
�Xs
Xu
�β +
�νs
νu
�(5)
�νs
νu
�= ρ
�Wss Wsu
Wus Wss
��νs
νu
�+
�εs
εu
�
• The log-likelihood for this model conditions the observed
sample ys on the missing sample observations yu, as shown
in (6).
L(β, σ2, ρ|ys) =
Zf(y|β, σ
2, ρ)dyu (6)
f(y|β, σ2, ρ) = (2πσ
2)−n/2|Γ|−1/2
exp[−(1/2σ2)e
′Γ−1
e]
Γ = σ−2
var(ν)
• Where: e = y − Xβ which can be partitioned as: es =
ys − Xsβ and eu = yu − Xuβ.
• application of the partitioning scheme to the term: e′Γ−1e
from this likelihood results in the following expression:
3
e′Γ−1
e = e′sΓ
sses + euΓ
uses + e
′sΓ
sueu + e
′uΓ
uueu (7)
• Where Γss, Γus, Γsu and Γuu represent partitioned matrices
from Γ−1.
• The expressions eu = yu − Xuβ in (7) should make it clear
that information regarding the characteristics of unsold homes
or missing values represented by Xu enters into the likelihood.
• In addition, information regarding the spatial covariance or
connectivity between sold and unsold homes captured by
Γus, Γsu contributes to the likelihood.
• Of course, estimates based on maximizing the likelihood in
(6) are not possible because sample observations on yu are
not available. However, we can replace these missing values
with their expected values conditional on the observed sample
information in the likelihood, as motivated by the integral.
Assuming normally distributed disturbances ε in this model,
the expectation of yu|ys is:
E(yu|ys) = Xuβ + ΓusΓss
(ys − Xsβ) (8)
• Since this conditional expectation consists of parameters and
observed sample information, we can replace the missing values
yu in the likelihood function. This would allow maximizing
the likelihood to produce estimates for all parameters in the
model.
4
An important point to note:
• The “full” data likelihood is:
f(ys, yu|Xs, Xu, β, σ2, ρ) = f(yu|ys, Xu, β, σ
2, ρ)
· f(ys|Xs, Xu, β, σ2, ρ)
• Where the second term on the right is the “incomplete” or
marginal data likelihood and it is available explicitly. We can
integrate out yu, and just maximize f(ys|Xs, Xu, β, σ2, ρ)
over the parameters and this will give the incomplete
data MLE’s. If prediction of yu is of interest,
E(yu|ys, Xs, Xu, β, σ2, ρ) is a parametric function of the
model parameters. Plugging in MLE’s yields the MLE of this
predictive mean.
• Improved spatial prediction has nothing to do with imputing
unobserved responses yu.
• It does however have to do with utilizing Xu in addition to
Xs.
5
The SAR model:
• The same approach can be taken for the SAR model
partitioned (with the SDM model a trivial extension).
• A difference is that the expression for E(yu|ys) in this model
takes the form shown in (9) based on standard multivariate
normal distribution theory (Poirier, 1995, page 122).
E(yu|ys) = µu + ΓusΓss
(ys − µs) (9)
µu = B21
Xuβ + B22
Xsβ
µs = B11
Xsβ + B12
Xuβ
B =
�Is − ρWss −ρWsu
−ρWus Iu − ρWuu
�(10)
• In (9), Bij denotes the i, j partition of the inverse of the
matrix B, which is a function of the parameter ρ alone. As in
the case of the SEM model, we can maximize the log-likelihood
after replacing the missing values yu with their expected values
conditional on the observed sample information, ys. From a
computational standpoint this requires that we calculate the
matrix inverse, B−1, which is an n by n matrix.
6
Implementation details
• We first consider the SEM model, where things are
computationally simpler than for the SAR model. We require
an efficient way to compute:
E(yu|ys) = Xuβ + ΓusΓss
(ys − Xsβ) (11)
• Straightforward implementation of (11) involves the inverse
matrix σ2[(In − ρW )(In − ρW ′)]−1.
• The spatial weight matrix W is sparse, as is (In − ρW ).
Unfortunately, matrix inversion results in “fill-in” of the sparse
matrix product (In − ρW )′(In − ρW ), adding to memory
storage requirements.
• As an example, the number of non-zero elements in [(In −ρW )′(In−ρW )] for a sample of 4,048 housing observations
from Lucas County Ohio was 30,748, whereas the number
in [(In − ρW )(In − ρW ′)]−1 was 1,639,838, requiring an
increase in memory of over 50 times.
7
• Gelman, Carlin, Stern and Rubin (1995, p. 479)
provide the key to efficient computation. Let Γ−1 =
(1/σ2)(I − ρW )′(I − ρW ), and form the matrix C =
(I − [diag(Γ−1)]−1Γ−1, such that the distribution of the ith
element of y, conditional on all other elements j 6= i takes
the form:
(yi|yj, j 6= i) ∼ N [µi +Xj 6=i
cij(yj −µj), (1/Γii)] (12)
• Adopting this in a way that is ideally suited to our application
based on the partitioning of observations into observed and
missing values results in:
C = In − [diag(Γ−1
)]−1
Γ−1
(13)
E(yu|ys) = µu + Cus(ys − µs)
var(yu|ys) = ι � diag(Γ−1
)uu
• Where ι denotes a vector of ones, and ι � diag(Γ−1)uu
indicates dividing ι elementwise by the vector diag(Γ−1)uu,
reflecting the main diagonal elements of the uu partition from
the inverse of Γ. The expression Cus refers to the partitioned
element of C.
8
• The key point to note here is that the expression Γ−1 =
(1/σ2ε)(In − ρW )′(In − ρW ) that appears in (13), allows
us to avoid computing the inverse.
• In addition, two of the matrices are diagonal matrices,
[diag(Γ−1)]−1, and ι � diag(Γ−1)uu, resulting in a
tremendous savings in memory, as well as speedy computation
of the matrix C.
• A related point is that computing diag(Γ−1) involves simply
computing the sum of the squared column elements of the
matrix W , multiplying these by ρ2, and adding a vector of
ones. e.g.,
diag[(In − ρW )′(In − ρW )] = diag(In + ρ
2W
′W − ρW
′ − ρW )
= diag(In + ρ2W
′W )
(This occurs because the spatial weight matrix has zeros on
the main diagonal.)
9
Alternative approaches to estimation
• Approach #1: Maximization of the log-likelihood for this
problem involves a constrained optimization problem with
parameters β, σ and ρ constrained to an interval 1/λmin <
ρ < 1/λmax, where λmin, λmax denote the minimum and
maximum eigenvalues of the spatial weight matrix W .
• Approach #2: An alternative is to construct a “repaired”
data vector y′ = [ys E(yu|ys)]′ and use existing efficient
algorithms for maximum likelihood estimation of the SAR,
SEM, SDM models in an iterative scheme, similar to an E-M
algorithm.
• Approach #3: Another alternative is to treat E(yu|ys) as
latent variables that become part of the estimation problem in
an MCMC scheme.
10
Advantages of Approach #3
1. Bayesian prior information can be introduced for the
parameters β, σ, ρ based on perhaps the observed sample
ys, Xs. This could reduce the dispersion of estimates and
prediction outcomes.
2. One need not assume a multivariate normal distribution for y,
a multivariate form of the t−distribution could be used. As
in the case of the multivariate normal distribution, conditional
distributions from the multivariate t take a multivariate t
form. Assuming y ∼ tn(µ, Σ, δ) will lead to a conditional
mean: E(yu|ys) = [µu +ΣusΣ−1ss (ys−µs)], (see Theorem
3.4.10 in Poirer (1995)).
3. Posterior measures of dispersion for the estimates and
predictions are easily obtainable from the MCMC draws.
This is in contrast to maximum likelihood estimation where
numerical hessians are required to compute measures of
dispersion.
11
Specifics of approach #2
• We have already established that it is easy to construct
E(yu|ys), which can be used to produce a repaired vector
y. This in conjunction with the log-likelihood function
concentrated with respect to β, σ:
lnL(ρ) = F + ln|In − ρW | − (n/2)ln(e′e)
e = eo − ρed
eo = y − Xβo
ed = Wy − Xβd
βo = (X′X)
−1X
′y
βd = (X′X)
−1X
′Wy (14)
Where F represents a constant not involving the parameters.
Pace and Barry (1997) demonstrate that direct sparse matrix
Cholesky algorithms can be used to compute the log-
determinant over a grid of values for the parameter ρ restricted
to the interval (−1, 1), and Barry and Pace (1999) provide
an approximation approach for doing this.
• Given this grid, a vector evaluation of the SAR log-likelihood
function over this grid of log-determinant values can be used
to find maximum likelihood estimates.
12
• Specifically,we might use a grid of q values for ρ in the interval
(1/λmin, 1/λmax)
0BB@
LnL(ρ1)LnL(ρ2)
...LnL(ρq)
1CCA ∝
0BB@
Ln|In − ρ1W |Ln|In − ρ2W |
...Ln|In − ρqW |
1CCA−(n/2)
0BB@
Ln(φ(ρ1))Ln(φ(ρ2))
...Ln(φ(ρq))
1CCA
(15)
where φ(ρi) = e′oeo − 2ρie′deo + ρ2
i e′ded.
• Having solved for the optimal ρ, conditional on the repaired
sample involving yu, the parameters β = βo − ρβd, and
σ2 = (eo−ρed)′(eo−ρed)/(n−k), can be used to produce
a new yu and another repaired sample for y. This forms the
basis for another iteration of the (E-M like) algorithm, with
the process continued until convergence.
13
Specifics of approach #3
• A formal statement of the Bayesian SAR model is shown in
(16), where we have added a normal-gamma conjugate prior
for β and σ, and a uniform prior for ρ. The prior distributions
are indicated using π.
y = ρWy + Xβ + ε (16)
ε ∼ N(0, σ2In)
π(β) ∼ N(c, T )
π(1/σ2) ∼ Γ(d, ν)
π(ρ) ∼ U [0, 1]
• To implement this estimation method, we need to determine
the conditional distributions for each parameter in our Bayesian
SAR model as well as yu, which will be treated as a latent
variable and could be viewed as additional parameters to be
estimated.
14
MCMC sampler
• begin with arbitrary β0, σ0, ρ0 and sample sequentially from:
1. p(yu|ys, β0, σ0, ρ0), shown in the paper, using the
computationally efficient approach to calculation described.
This updated value for the parameter vector yu we label
y1u.
2. p(β|σ0, ρ0, y1u), which is a multinormal distribution with
mean and variance defined in the paper. Note that we
rely on the updated value of the parameter vector y1u when
evaluating this conditional density.
3. p(σ|β1, ρ0, y1u), which is chi-squared distributed with n+
2d degrees of freedom as shown in the paper.
4. p(ρ|β1, σ1, y1u), which we sample using numerical
integration and inversion. Note also that it is easy to
implement a normal or some alternative prior distribution
for this parameter.
• We now return to step 1) employing the updated parameter
values in place of the initial values y0u, β0, σ0, ρ0. On each
pass through the sequence we collect the parameter draws
which are used to construct a joint posterior distribution for
the parameters in our model.
15
Monte Carlo experiment
Results somewhat dependent on the spatial configuration or
location of the missing versus non-missing observations.
• One way to compare accuracy is to consider the gap between
OLS prediction accuracy and the best possible prediction
accuracy associated with a benchmark model, SAR-ALL based
on using all of the data.
– OLSe = 0.30
– SAR − ALLe = 0.15
– gap = mean(|(eOLS|)− mean(|(eALL|) = 0.15
– SAR − MISSe = 0.20
– The SAR-MISS model closed 0.1/0.15 = 66 % of the
gap.
• For the case of high spatial connectivity sample the SAR-MISS
model closed 97 percent of the gap.
• For the low spatial connectivity sample the SAR-MISS closed
89.5 percent of the gap.
• Figures showing distribution of estimates for β and ρ.
16
−4 −3 −2 −1 0 1 2 3−4
−3
−2
−1
0
1
2
3
4soldunsold
Figure 1: The location of unconnected sold and unsold properties
17
0 200
0
100
200
nz = 1406
WSS
0 200
0
100
200
nz = 97
WSU
0 100 200
0
100
200
nz = 1406
WUU
Figure 2: Low spatial connectivity between sold and unsold
properties
18
−4 −3 −2 −1 0 1 2 3−4
−3
−2
−1
0
1
2
3
4soldunsold
Figure 3: The location of connected sold and unsold properties
19
0 200
0
100
200
nz = 600
WSS
0 200
0
100
200
nz = 906
WSU
0 100 200
0
100
200
nz = 594
WUU
Figure 4: High spatial connectivity between sold and unsold
properties
20
0.74 0.76 0.78 0.8 0.82 0.84 0.860
5
10
15
20
25
30
35
40
45
ρ values
Dis
trib
utio
n of
est
imat
es
SAR−EMSAR−ALL
Figure 5: Distribution of rho estimates, 1,000 trials
21
0.94 0.96 0.98 1 1.02 1.04 1.060
5
10
15
20
25
30
35
β intercept parameters
Dis
trib
utio
n of
est
imat
es
SAR−EMSAR−ALL
0.94 0.96 0.98 1 1.02 1.04 1.060
5
10
15
20
25
30
35
β slope parameters
Dis
trib
utio
n of
est
imat
es
SAR−EMSAR−ALL
Figure 6: Distribution of intercept and slope estimates, 1,000
trials
22
Similar results for actual real estate data
• Five time periods examined, using 1993 sales to predicted
1994, 1994 to predict 1995, and so on. Using around 4,000-
6,000 sales to predict 4,000-6,000 sales the following year.
• The SAR-MISS predictions closed 61 to 74 percent of the gap
between OLS and SAR-ALL accuracy, averaging around 70
percent.
• Using 5,164 observations (one of every six) to predict the
remaining 25,823 we found: 58.1 percent of the gap between
OLS and SDM accuracy closed by the SDM-MISS model.
• Using 5,164 observations (one of every six) to predict
the remaining 25,823 we found: 68.5 percent of the gap
between OLS and SDM accuracy closed by the Bayesian
heteroscedastic/robust MCMC SDM model, compared to 58.1
percent for the non-robust model.
23
Conclusion
• The issue of using widely available information on the
characteristics and location of unsold properties has not
received much attention in the literature on hedonic price
models.
• These characteristics in combination with knowledge of their
location relative to sold properties provides a large amount
of covariance information that need not be ignored simply
because the dependent variable is missing.
• Spatial estimators employing information on the unsold
properties have direct application in real estate assessment,
and such approaches have potential to materially change price
indices, and to ameliorate sample selectivity biases.
• Future work: sample selectivity bias associated with using only
sold homes.
• Future work: the impact of Bayesian prior information based
on the sold sample.
24
A spatial probit model
• A Bayesian probit model with individual effects that exhibit
spatial dependencies is set forth. Since probit models are often
used to model variation in individual choices, a model that
includes spatial interaction effects due to latent unobservables
associated with varying spatial location of the decision makers
seems useful.
Examples:
– Commuting choices made by individuals living in various
parts of a city.
– Public policy issues related to the Tiebout hypothesis such
as tax competition, welfare competition, benefit spillovers,
environmental/pollution spillovers.
– Voting choices.
25
• Individual choice is often modelled as dependent on:
– observed attributes of the choices
– observed characteristics of individuals.
Latent or unobserved attributes of the choices or characteristics
of individuals are frequently ignored. To the extent that these
latent factors are associated with the location of decision-
makers we can parsimoniously model these using a spatial
autoregression.
• The model proposed here allows for a parameter vector of
spatial interaction effects that takes the form of a spatial
autoregression. This model extends the class of Bayesian
spatial logit/probit models presented in LeSage (2000) and
relies on a hierachical construct that we estimate via Markov
Chain Monte Carlo methods.
26
Utility maximizing choices
For a binary 0, 1 choice, made by individuals k in region i
with alternatives labeled a = 0, 1:
Uik0 = γ′ωik0 + α
′0sik + θi0 + εik0
Uik1 = γ′ωik1 + α
′1sik + θi1 + εik1 (17)
Where:
ω represent observed attributes of the a = 0, 1 alternative
s represent observed attributes of individuals k
θia + εika represent unobserved properties of individuals
k, regions i or alternatives a.
We decompose the unobserved effects on utility into:
– a regional effect θia, assuming homogeneity across
individuals k in region i.
– an individualistic effect εika
The individualistic effects, εika are assumed conditionally
independent given θia, so unobserved dependencies between
individual utilities for alternative a within region i are captured
by θia.
27
Spatial autoregressive unobserved interactioneffects
Following Amemiya (1995, section 9.2) one can use
utility differences between alternatives along with the utility
maximization hypothesis to arrive at a probit regression
relationship.
zik = Uik1 − Uik0
= x′ikβ + θi + εik (18)
Our contribution is to model the the unobserved dependencies
between utility differences of individuals in separate regions
(the regional effects θi : i = 1, . . . , m) as following a
spatial autoregressive structure:
θi = ρ
mXj=1
wijθj + ui, i = 1, . . . , m
u ∼ N(0, σ2Im) (19)
Intuition here is that unobserved utility-difference aspects that
are common to individuals in a given region may be similar to
those for individuals in neighboring or nearby regions.
28
Heteroscedastic individual effects
• εik are treated as exchangeable and modeled as conditionally
iid normal variates with zero means and common variance vi,
given θi.
εi = (εik : k = 1, . . . , ni)′
εi|θi ∼ N(0, viIni)
ε|θ ∼ N(0, V )
V =
0@ v1In1
. . .
vmInm
1A
• The model in vector form:
z = Xβ + ∆θ + ε
∆ =
0@ 11
. . .
1m
1A
εi|θi ∼ N(0, viIni)
V = diag(∆v), v = (vi : i = 1, . . . , m)(20)
29
Albert and Chib (1993) latent treatment of z
Pr(Yik = 1|zik) = δ(zik > 0) (21)
Pr(Yik = 0|zik) = δ(zik ≤ 0)
• Where: δ(A) is an indicator function δ(A) = 1 for all
outcomes in which A occurs and δ(A) = 0 otherwise
• If the outcome value Y = (Yik ∈ 0, 1), then [following
Albert and Chib (1993)] these relations may be combined as
follows:
Pr(Yik = yik|zik) = δ(yik = 1)δ(zik > 0)
+ δ(yik = 0)δ(zik ≤ 0) (22)
• Which produces a conditional posterior for zik that is a
truncated normal distribution, which can be expressed as
follows:
zik|? ∼�
N(x′iβ + θi, vi) left-truncated at 0, if yi = 1
N(x′iβ + θi, vi) right-truncated at 0, if yi = 0(23)
30
Hierachical Bayesian Priors
The following prior distributions are standard [see LeSage,
(1999)]:
β ∼ N(c, T ) (24)
r/vi ∼ IDχ2(r) (25)
1/σ2 ∼ Γ(α, ν) (26)
ρ ∼ U [(λ−1min, λ
−1max)] (27)
These induce the following conditional priors:
π(θ|ρ, σ2) ∼ (σ
2)−m/2|Bρ|exp
�−
1
2σ2θ′B′ρBρθ
�
Bρ = Im − ρW (28)
π(ε|V ) ∼ |V |−1/2exp
�−
1
2ε′V−1
ε
�(29)
π(z|β, θ, V ) ∝ |V |−1/2exp
�−
1
2e′V−1
e
�(30)
e = z − Xβ −∆θ
31
Estimating the model
• Estimation will be achieved via Markov Chain Monte Carlo
methods that sample sequentially from the complete set of
conditional distributions for the parameters. To implement
the MCMC sampling approach we need to derive the complete
conditional distributions for all parameters in the model. Given
these, we proceed to sample sequential draws from these
distributions for the parameter values. Gelfand and Smith
(1990) demonstrate that MCMC sampling from the sequence
of complete conditional distributions for all parameters in the
model produces a set of estimates that converge in probability
to the true (joint) posterior distribution of the parameters.
• The complete conditional distributions for all parameters in
the model are derived in the paper.
• A few comments on innovative aspects:
32
The conditional distribution of θ:
p(θ|β, ρ, σ2, V, z, y) ∼ N(A
−10 b, A
−10 ) (31)
A0 = σ−2
B′ρBρ + ∆
′V−1
∆
b0 = ∆′V−1
(z − Xβ)
• where the mean vector is A−10 b0 and the covariance matrix
is A−10 , which involves the inverse of the mxm matrix A0
which depends on ρ. This implies that this matrix inverse
must be computed on each MCMC draw during the estimation
procedure. Typically a few thousand draws will be needed to
produce a posterior estimate of the parameter distribution
for θ, suggesting that this approach to sampling from the
conditional distribution of θ may be costly in terms of time if
m is large.
• In our illustration we rely on a sample of 3,110 US counties
and the 48 contiguous states, so that m = 48. In this
case, computing the inverse was relatively fast allowing us
to produce 15,000 draws in 285 seconds using a compiled c-
language program on a laptop with a Pentium 1.6M processor.
33
An alternative approach for large problems
• In the Appendix we provide an alternative approach that
involves only univariate normal distributions for each element
θi conditional on all other elements of θ excluding the ith
element, (θ−i). This approach is amenable to computation
for much larger sizes for m.
Specifically:
– The univariate normal density for each θi given θ−i, i =
1, . . . , m takes the form:
θi|(θ−i, β, ρ, σ2, V, z, y) ∼ N
�bi
ai
,1
ai
�(32)
ai =1
σ2+
ρ2
σ2w′.iw.i +
ni
vi
bi = φi +ρ
σ2
Xj 6=i
θj(wji + wij)−ρ2
σ2w′.iW−iθ−i
• This approach allows us to produce estimates for a problem
using nearly 60,000 US census tracts with 3,000 counties as
the regions. Time required to produce 10,000 draws is around
45 minutes using a compiled c-language program on a laptop
with a Pentium 1.6M processor.
34
The conditional distribution of ρ:
p(ρ|?) ∝ |Bρ|exp�−
1
2σ2θ′(Im − ρW )
′(Im − ρW )θ
�
(33)
• where ρ ∈ (λ−1min, λ−1
max). As noted in LeSage (2000) this is
not reducible to a standard distribution, so we might adopt
a M-H step during the MCMC sampling procedures. LeSage
(1999) suggests a normal or t− distribution be used as a
transition kernel in the M-H step.
• Another approach that we are experimenting with for this
model is to rely on univariate numerical integration to obtain
the the conditional posterior density of ρ, and then produce
a draw by inversion. This requires integration on every trip
through the sampler, but is speedy using vectorization.
35
Special cases of the model
• The homoscedastic case, where individual variances are
assumed equal across all regions, so the regional variance
vector, v reduces to a scalar
• The individual spatial-dependency case where individuals
are treated as ‘regions’ denoted by the index i.. In this
case we are essentially setting m = n and ni = 1 for all
i = 1, . . . , m.
• Note that although one could in principle consider
heteroscedastic effects among individuals, the existence of
a single observation per individual renders estimation of such
variances problematic at best.
36