Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
SEM with Missing Data and Unknown Population Distributions Using Two-stage ML:
Theory and Its Application∗
Ke-Hai Yuan and Laura Lu
University of Notre Dame
January 14, 2008
Revised July 24, 2008
Second revision, August 7, 2008
∗The research was supported by NSF grant DMS04-37167 and Grants DA00017 and DA01070
from the National Institute on Drug Abuse. Correspondence concerning this article: Ke-Hai
Yuan, Department of Psychology, University of Notre Dame, Notre Dame, IN 46556, USA
Abstract
This paper provides the theory and application of the two-stage ML procedure for struc-
tural equation modeling (SEM) with missing data. The validity of this procedure does not
require the assumption of a normally distributed population. When the population is nor-
mally distributed and all missing data are missing at random (MAR), the direct maximum
likelihood (ML) procedure is nearly optimal for SEM with missing data. When missing data
mechanisms are unknown, including auxiliary variables in the analysis will make the missing
data mechanism more likely to be MAR. It is much easier to include auxiliary variables in
the two-stage ML than in the direct ML. Based on most recent development for missing
data with an unknown population distribution, the paper first provides the least technical
material on why the normal distribution-based ML generates consistent parameter estimates
when the missing data mechanism is MAR. The paper also provides sufficient conditions for
the two-stage ML to be a valid statistical procedure in the general case. For the application
of the two-stage ML, a SAS IML program is given to perform the first-stage analysis and
EQS codes are provided to perform the second-stage analysis. An example with open/closed
book examination data is used to illustrate the application of the provided programs. One
aim is for quantitative graduate students/applied psychometricians to understand the tech-
nical details for missing data analysis. Another aim is for applied researchers to use the
method properly.
Keywords: Asymptotic normality, auxiliary variables, consistency, missing at random, sandwich-
type covariance matrix.
1. Introduction
In social and behavioral sciences, data are typically collected using survey or question-
naire. Missing data cannot be avoided in the process of data collection, especially those
that are of a longitudinal nature. Because samples with missing values lose the balance
characteristic of their complete counterpart, special methods have to be developed for their
analysis. The method of the analysis is closely related to the reason of the missingness or
missing data mechanism. Rubin (1976) and Little and Rubin (2002, pp. 11–19) formally
defined three missing data mechanisms. Missing completely at random (MCAR) is a process
in which missingness is independent of both the observed and the missing values; missing
at random (MAR) is a process in which missingness depends on the observed values, not
the values being missed; missing not at random (MNAR) is a process in which missingness
depends on the values being missed. The missing data with MCAR and MAR mechanisms
are also referred to as ignorable nonresponses, because maximum likelihood estimates (MLE)
preserve many of their properties if these mechanisms are ignored.
Structural equation modeling (SEM) is regularly used for analyzing survey data, where
incomplete or missing data occur regularly. Many procedures have been developed for SEM
with missing data. Two noted ones are the direct maximum likelihood (ML) and the two-
stage ML using the normal distribution assumption (e.g., Jamshidian & Bentler, 1999; Savalei
& Bentler, 2007). In the direct ML, the MLEs of the structural parameters are obtained by
directly maximizing the likelihood function. For the two-stage ML, MLEs of the means and
covariances of the saturated model are obtained in the first stage, parallel to the direct ML.
The MLEs of the saturated model are further fitted by the structural model in the second
stage, using the normal distribution based discrepancy function. When the population is not
normally distributed, previous literature suggests that all missing values need to be MCAR
in order for the normal distribution based MLE to be consistent (e.g., Laird, 1988; Rotnitzky
& Wypij, 1994; Yuan & Bentler, 2000). However, recent development indicates that normal
distribution based MLE is still consistent under the MAR mechanism and nonnormally
distributed population (Yuan, 2007). One of the purposes of the paper is to introduce this
result using the simple bivariate case and provide the program set up for applying it to
1
SEM using the two-stage ML. We will show how to get consistent standard errors (SEs) for
parameter estimates and valid statistics for overall model evaluation.
When missing values are MNAR, one generally has to model the missing data mechanism
in order to obtain consistent parameter estimates. However, missing data mechanisms de-
pend on whether variables related to missingness are observed and included in the analysis.
In practice, a researcher often has no difficulty in identifying a set of interesting variables
to test a theory using SEM. It is always difficult to verify the MAR mechanism. Schafer
and Olsen (1998), Collins, Schafer and Kam (2001), and Graham (2003) suggested that one
should include as many variables as possible to maximize the chance for the MAR mechanism.
One can use the two-stage ML to implement the suggestion. In the first stage, the normal
distribution based MLEs are obtained for the means and covariances of all the variables. In
the second stage, only estimates of means and covariances corresponding to the interesting
variables are selected and fitted by the structural model. The variables that participate in
the first-stage estimation but not in the second-stage analysis are called auxiliary variables.
A recent study by Savalei and Bentler (2007), with normally distributed data, showed that
the two-stage ML is nearly optimal when including the auxiliary variables. In this paper, we
will study the two-stage ML when a sample comes from an unknown population distribution
and missing values are MAR, aiming to develop a valid procedure for SEM with typically
nonnormally distributed samples in practice (Micceri, 1989). We will present the inference
problem for SEM with missing data the same as that for complete data from an unknown
population distribution. Once unified, many procedures developed for complete data can be
equally applied for the two-stage ML with missing data.
Although there are many studies on missing data in the SEM literature, few facilitate a
good understanding of the problem and result, especially when the population is nonnormally
distributed and the missing data mechanism is MAR. For a sample with missing data from
an unknown population distribution, the results obtained by Rubin (1976) do not apply to
the direct or two-stage ML due to working with a wrong likelihood function. Arminger and
Sobel (1990) proposed to apply the so-called pseudo ML to SEM with missing data, but
they did not show why or when the pseudo ML, developed by White (1982) and Gourieroux,
2
Monfort and Trognon (1984) for complete data, can be applied to missing data. Actually,
examples exist for the pseudo ML to fail when data are nonnormally distributed and the
missing data mechanism is MAR, one will be given in section 3. Yuan (2007) provided the
conditions for normal distribution based ML to apply to a nonnormal population distribution
with MAR data, but the paper mainly dealt with saturated means and covariances, not SEM
in particular. The development is also beyond the level of most quantitative psychologists.
Actually, with missing data, even for a normally distributed population, little literature
in SEM exists that facilitates thorough understanding of issues related to proper statistical
inference; almost all are about how to obtain the MLEs and the likelihood ratio (LR) statistic.
To fill the gap, the paper will provide the technical details for inference with missing data
using a simple scenario. The prerequisite knowledge is basic calculus, linear algebra, and
a graduate course in statistics/probability. We aim to let quantitative students in social
sciences fully understand the components for missing data inference. They are consistency,
asymptotic normality and the so-called sandwich-type covariance matrix. We expect that
the material will be used as a teaching tool in SEM courses for graduate students in the
social and behavioral sciences. After obtaining the results in the simplest case, we will relate
them to parallel results in higher dimensions. Such a connection will allow readers to better
understand the limitation and strength of missing data methodology. We hope that, with
a better understanding of the problem, researchers will be able to choose the most proper
methodology for SEM with missing data instead of merely relying on the default output of
a software.
Even though the two-stage ML has a lot to offer for SEM with missing data from typ-
ically unknown population distributions in practice, standard software does not facilitate
its applications. For example, EQS and Mplus provide the MLEs and their SEs based on a
sandwich-type covariance matrix. But their outputs do not contain the whole sandwich-type
covariance matrix, which is a necessary component for the second-stage analysis. Another
contribution of this paper is to introduce a SAS IML program for the first-stage analysis.
For any missing data with auxiliary variables, by modifying three lines of the provided state-
ments, the program will generate the necessary elements for mean and covariance structure
3
analysis in the second stage. With these elements, SEM with missing data is essentially the
same as SEM with complete data for any SEM programs that allow users to input the first
four moments for analysis.
In section 2 of the paper, we show the consistency and asymptotic normality of the MLE
using the simple bivariate case. Section 3 provides the components for proper two-stage
SEM using the normal distribution based estimates of means and covariances. Section 4
introduces SAS and EQS programs for implementing the two-stage ML. An example using
the test score data of Mardia, Kent and Bibby (1979) is used to illustrate the application of
these programs. Discussion and conclusion are provided in section 5.
2. Consistency, Asymptotic Normality and Sandwich-type Covariance Matrix in a Simple
Bivariate Case
Let x = (x1, x2)′ be a 2-variate population with mean µ = (µ1, µ2)
′ and covariance matrix
Σ =
(
σ11 σ12
σ21 σ22
)
,
where the distribution of x is unknown. Only for the purpose of simplicity we assume that
Σ is known. Suppose a sample from x is obtained as
x11, . . . , xn1, x(n+1)1 . . . , xN1
x12, . . . , xn2,(1)
where the first variable is observed on all the cases while the second variable is observed only
on the first n cases. We will use this sample to show that, under very general conditions, the
normal distribution based MLE is consistent and asymptotically normally distributed, and
that the asymptotic covariance matrix of the MLE is consistently estimated by a sandwich-
type covariance matrix. In the following we will first obtain the normal distribution based
MLEs of µ1 and µ2. Then we will study the properties of the MLEs.
Let xi = (xi1, xi2)′ for i = 1, 2, . . ., n; σ1 = σ
1/211 , σ2 = σ
1/222 , and ρ = σ12/(σ1σ2). After
omitting a constant, the normal distribution-based log likelihood function for the sample in
(1) is given by
l(µ) =N∑
i=1
li(µ),
4
where
li(µ) = −1
2(xi − µ)′Σ−1(xi − µ)
= − 1
2(1 − ρ2)[(xi1 − µ1)
2
σ11− 2ρ(xi1 − µ1)(xi2 − µ2)
σ1σ2+
(xi2 − µ2)2
σ22],i = 1, 2 . . . , n;
(2a)
li(µ) = − 1
2σ11(xi1 − µ1)
2, i = n+ 1, n+ 2, . . . , N. (2b)
Notice that l(µ) is a quadratic function of µ1 and µ2. Setting the derivatives of l(µ) with
respect to µ1 and µ2 equal to zero leads to two linear equations for µ1 and µ2. Let
x1 =1
N
N∑
i=1
xi1, x1∗ =1
n
n∑
i=1
xi1, x2∗ =1
n
n∑
i=1
xi2.
The MLEs
µ1 = x1 and µ2 = x2∗ + b(x1 − x1∗) (3)
are obtained by directly solving these equations, where b = σ12/σ11 is the slope of x2 regress-
ing on x1.
We will next show that µ2 is consistent and (µ1, µ2) are jointly asymptotically normally
distributed when the missing values in (1) are MAR. For the sample in (1), we need to have
a data model to introduce the MAR mechanism. Let z1 and z2 be independent random
variables, each having a mean 0 and variance 1.0. We may denote them as z1 ∼ F1(t) and
z2 ∼ F2(t) with F1 and F2 being two arbitrary cumulative distribution functions (CDF). A
widely used population model for bivariate random variables is (e.g., Hoffman, 1959; Lee &
Rodgers, 1998; Little & Rubin, 2002, p. 23)
x1 = µ1 + σ1z1, x2 = µ2 + σ2[ρz1 + (1 − ρ2)1/2z2]; (4a)
or
x = µ + Lz, (4b)
where
L =
(
σ1 0σ2ρ σ2(1 − ρ2)1/2
)
and z = (z1, z2)′. Let the sample in (1) be generated according to
w =
1 if z1 ≤ c,0 if z1 > c,
(5)
5
where w = 1 indicates that x2 is observed and w = 0 indicates that x2 is missing. Since
z1 > c is equivalent to x1 > (µ1 +σ1c), the missing values within the sample in (1) are MAR.
Corresponding to the observed and missing data are two random vectors z∗ = (z∗1, z∗2)′ and
z∗ = (z∗1 , z∗
2)′, that is, z∗ ∼ (z|w = 1) and z∗ ∼ (z|w = 0). These conditional random vectors
will facilitate obtaining the consistency of the MLEs and the sandwich-type covariance matrix
using the law of large numbers, and asymptotic normality of the MLEs using the central limit
theorem. Since z1 and z2 are independent, so are w and z2. Consequently, z2∗ = (z2|w =
1) = z2 ∼ F2(t), z∗2 = (z2|w = 0) = z2 ∼ F2(t), and z2∗ and z∗2 are independent with z1∗ and
z∗1 .
Let (zi, wi), i = 1, 2, . . ., N , be a random sample from (z, w) through which the sample
in (1) is created. Corresponding to the sample (zi, wi) are independent random vectors zi∗ =
(zi1∗, zi2∗)′ from the population z∗ and z∗i = (z∗i1, z
∗i2)
′ from the population z∗. Obviously,
n =∑N
i=1 wi is random and follows the binomial distribution B(N,F1(c)). Not so obvious
is that, for a continuous function g(t1, t2),∑n
i=1 g(zi1, zi2) and∑N
i=1 wig(zi1∗, zi2∗) have the
same distribution;∑N
i=n+1 g(zi1, zi2) and∑N
i=1(1 − wi)g(z∗i1, z
∗i2) have the same distribution.
These relationships will also be used to study the consistency and asymptotic normality of
µ2. A proof is provided in appendix A for readers who may have the interest to know the
details behind the equal distributions.
2.1 Consistency
Obviously, µ1 = x1 is consistent according to the law of large numbers. So we only need
to show the consistency of µ2. We can write the µ2 in (3) as
µ2 = bx1 + (x2∗ − bx1∗). (6)
Because x1 is consistent for µ1, we only need to find out the probability limit of the second
term on the right side of (6). Using (4a), we have
x2∗ − bx1∗ = (µ2 − bµ1) +1
n
n∑
i=1
σ2[ρzi1 + (1 − ρ2)1/2zi2] − bσ1zi1
= (µ2 − bµ1) +σ2(1 − ρ2)1/2
n
n∑
i=1
zi2
= (µ2 − bµ1) +σ2(1 − ρ2)1/2
∑Ni=1 wi/N
1
N
N∑
i=1
wizi2∗,
(7)
6
where the second equal sign is due to σ2ρ = bσ1. Notice that wi and zi2∗ are independent
with E(wi) = F1(c) and E(zi2∗) = 0. According to the law of large numbers,
1
N
N∑
i=1
wiP−→ F1(c) and
1
N
N∑
i=1
wizi2∗P−→ 0,
whereP−→ is the notation for convergence in probability. The consistency of µ2 follows from
(6), (7) and x1P−→ µ1.
2.2 Asymptotic normality
We will use the central limit theorem to show that µ1 and µ2 are jointly asymptotically
normally distributed. Let µ = (µ1, µ2)′, we will need to relate µ−µ to zi1, wi and zi2∗ in order
to apply the central limit theorem for independent and identically distributed observations.
It follows from (4a) that
x1 = µ1 +σ1
N
N∑
i=1
zi1. (8)
Combining (3), (7) and (8) yields
µ −µ = HN1
N
N∑
i=1
(
zi1
wizi2∗
)
, (9)
where
HN =
(
σ1 0bσ1 (
∑Ni=1 wi/N)−1σ2(1 − ρ2)1/2
)
.
Notice that the vectors (zi1, wizi2∗)′, i = 1, 2, . . ., N , are independent and identically dis-
tributed with mean (0, 0)′. It follows from the central limit theorem that
1√N
N∑
i=1
(
zi1
wizi2∗
)
converges to a 2-variate normal distribution with mean (0, 0)′ and variance-covariance matrix
Π = E
(
z21 z1wz2∗
wz2∗z1 w2z22∗
)
.
Recall that w is either 0 or 1 with E(w) = E(w2) = F1(c), and w is independent with
z2∗ ∼ F2(t). Thus,
Π =
(
1 00 F1(c)
)
.
It is easy to see that
HNP−→ H =
(
σ1 0bσ1 σ2(1 − ρ2)1/2/F1(c)
)
.
7
It follows from (9) and the well-known Slutsky theorem1 (e.g., Ferguson, 1996, pp. 39–42)
that√N(µ − µ)
L→ N(0,Ω), (10a)
whereL→ denotes convergence in distribution and
Ω = HΠH′ =
(
σ11 σ12
σ21 σ22 1 + [1/F1(c) − 1](1 − ρ2)
)
. (10b)
The asymptotic normality of the MLE has been established.
2.3 Sandwich-type covariance matrix
The Ω in (10) can be estimated by the so-called sandwich-type covariance matrix. Ac-
tually, only in the simplest cases can we estimate Ω directly, as in (10b) when replacing
F1(c) by n/N . The sandwich-type covariance matrix allows us to easily obtain a consistent
estimator for Ω in the general case.
The sandwich-type covariance matrix involves the first and second derivatives of the log
likelihood function with respect to the unknown parameters. For the li(µ) in (2), these are
given by the 2 × 1 vector li(µ) = ∂li(µ)/∂µ and the 2 × 2 matrix li(µ) = ∂2li(µ)/∂µ∂µ′.
Let
AN =1
N
N∑
i=1
li(µ) and BN =1
N
N∑
i=1
li(µ)l′i(µ). (11a)
The sandwich-type covariance matrix is given by
ΩSWN = A−1N BNA−1
N , (11b)
where AN and BN are obtained when the µ in (11a) is replaced by µ.
Because the proof of the consistency of ΩSWN involves a lot of notation and algebraic
operations, we put it in appendix B, where essentially the same technique is used as for
showing the consistency of µ.
We have assumed Σ known throughout this section for simplicity. Parallel results also
hold when Σ is estimated. In particular, for the observed sample in (1), analytical formulas
for the MLEs µ1, µ2, σ11, σ12, and σ22 are provided in Anderson (1957). Under the population
1The theorem states that, if xn
L→ x, an converges in probability to a and bn converges in probability to
b, anxn + bn
L→ ax + b.
8
model (4) and (5), and the condition that z1 ∼ F1(t) and z2 ∼ F2(t) have finite fourth-order
moments, the MLEs are consistent and asymptotically normally distributed. The asymptotic
covariance matrix is consistently estimated by a sandwich-type covariance matrix parallel to
(11).
3. Missing Data in the General Case and Inference in SEM
Let x be a p-variate population with E(x) = µ and Cov(x) = Σ. Let z1, z2, . . ., zp be
independent random variables with mean 0, variance 1 and have finite fourth-order moments;
L be a lower triangular matrix such that LL′ = Σ. Parallel to (4), a population model for
x is given by
x = µ + Lz, (12)
where z = (z1, z2, . . . , zp)′. Because the distributions of z1 to zp are arbitrary other than
subject to finite fourth-order moments, (12) includes an infinite number of distributions. A
normally distributed x corresponds to z ∼ Np(0, I). Let x1, x2, . . ., xN denote the observed
vectors for a sample from x with missing values. Corresponding to each xi is a vector zi =
(zi1, zi2, . . . , zip)′. Suppose xij1, xij2 , . . . xijk
are missing when each of zil1, zil2, . . ., zilm falls
into certain intervals. Such a kind of selected observations are widely used in the econometrics
literature to model human’s behavior (e.g., Amemiya, 1973; Heckman, 1979; Tobin, 1958),
where the distribution of x is known and the missing data mechanism is modeled. Here,
we do not know the distribution of x; because we consider the MAR mechanism, we will
not need to explicitly model it either. Let l = max(l1, l2, . . . , lm) and Ll be the upper-
left l × l submatrix of L, then (xi1, xi2, . . . , xil)′ = Ll(zi1, zi2, . . . , zil)
′. Although z is a
latent vector, all the information regarding (zi1, zi2, . . . , zil)′ is contained in (xi1, xi2, . . . , xil)
′.
When l < min(j1, j2, . . . , jk) and (xi1, xi2, . . . , xil)′ is observed, all the information related to
missing values are observed, thus the missing data mechanism is MAR (Rubin, 1976). When
l ≥ min(j1, j2, . . . , jk) or any variables in (xi1, xi2, . . . , xil)′ are missing, the probability of
missingness is related to the values of the variables being missed and the missing data
mechanism is MNAR.
Under (12) and the MAR mechanism described above, Yuan (2007) showed that the
normal distribution-based MLEs of µ and Σ are still consistent and asymptotically normally
9
distributed. Because Σ is symmetric, we let σ = vech(Σ) be the vector by stacking the
columns of the lower-triangular portion of Σ. Let β = (µ′, σ′)′ be the MLE of β = (µ′,σ′)′.
Parallel to (10) and (11), the asymptotic distribution of β is characterized by
√N(β − β)
L→ N(0,ΩSW ), (13a)
where ΩSW can be consistently estimated by
ΩSWN = A−1N BNA−1
N . (13b)
Specifically, after omitting a constant, the log likelihood function based on xi ∼ Npi(µi,Σi)
is
l(µ,Σ) =N∑
i=1
li(µ,Σ),
where pi is the number of observed variables in xi and
li(µ,Σ) = −1
2ln |Σi| −
1
2(xi −µi)
′Σ−1i (xi −µi), i = 1, 2, . . . , N.
Each µi is a subvector of µ and each Σi is a submatrix of Σ, corresponding to the observed
variables in xi. Let vec(Σi) be the p2i × 1 vector by stacking the columns of Σi, Dpi
=
∂vec(Σi)/∂σ′i, σi = ∂σi/∂σ′, µi = ∂µi/∂µ′, ⊗ be the notation for the so-called Kronecker
product (see e.g., Bollen, 1989, p. 465),
Wi =1
2D′
pi(Σ−1
i ⊗ Σ−1i )Dpi
,
and
liµ(µ,Σ) = ∂li(µ,Σ)/∂µ, liσ(µ,Σ) = ∂li(µ,Σ)/∂σ,
liµµ(µ,Σ) = ∂2li(µ,Σ)/∂µ∂µ′, liµσ(µ,Σ) = ∂2li(µ,Σ)/∂µ∂σ′,
liσµ(µ,Σ) = l′iµσ(µ,Σ), liσσ(µ,Σ) = ∂2li(µ,Σ)/∂σ∂σ′.
Then
liµ(µ,Σ) = µ′
iΣ−1i (xi − µi), liσ(µ,Σ) = σ′
iWivech[(xi − µi)(xi − µi)′ − Σi], (14)
liµµ(µ,Σ) = −µ′
iΣ−1i µi, liµσ(µ,Σ) = −µ′
iΣ−1i ⊗ [(xi − µi)
′Σ−1i ]Dpi
σi, (15a)
10
liσσ(µ,Σ) = σ′
iD′
pi[Σ−1
i (xi − µi)(xi −µi)′Σ−1
i − 1
2Σ−1
i ] ⊗Σ−1i Dpi
σi. (15b)
The matrices AN and BN in (13) are evaluated as
AN =1
N
N∑
i=1
liµµ(µ, Σ) liµσ(µ, Σ)
l′iµσ(µ, Σ) liσσ(µ, Σ)
and
BN =1
N
N∑
i=1
liµ(µ, Σ)l′iµ(µ, Σ) liµ(µ, Σ)l′iσ(µ, Σ)
liσ(µ, Σ)l′iµ(µ, Σ) liσ(µ, Σ)l′iσ(µ, Σ)
.
A sandwich-type covariance matrix for structured parameter parallel to (13) is given in
Arminger and Sobel (1990), but they did not provide conditions for the result in (13) to
hold.
The condition stated in this section allows the MAR mechanism to depend on all the
linear combinations of the previously observed variables. For such a purpose, we specified
L as a lower triangular matrix so that (zi1, zi2, . . . , zil)′ and (xi1, xi2, . . . , xil)
′ are determined
by each other. In practice, a participant may join the study after missing a few times and be
missing again. The missingness at a later stage may depend on all the previously observed
variables. We can match such a case with (12) by specifying a L whose rows corresponding
to the observed variables form the upper-left part of a lower triangular matrix. Then the
result in (13) still holds.
Although, for any µ and Σ, (12) contains infinitely many nonnormal distributions for
which (13) holds under the MAR mechanism described, the result in (13) does not hold
for all nonnormal distributions. Yuan (2007) provided the following example with p = 2,
µ1 = µ2 = 0,
L =
(
1 0 00 l22 l23
)
and z = (z1, z21 , z2)
′, where z1 and z2 are independent and each follows N(0, 1). Suppose
xi2 is missing when xi1 = zi1 is too large, the missing data mechanism is still MAR. But
the MLEs of µ2, σ12 and σ22 are no longer consistent. The problem is due to the nonlinear
relationship of x1 and x2. In such cases, including more auxiliary variables in the first stage
of the two-stage ML may mitigate the bias.
11
When all missing values are MCAR, the result in (13) holds for any nonnormally dis-
tributed population (Yuan & Bentler, 2000). There is no need for the data model (12) to be
part of the condition.
The β in (13) contains both µ and σ = vech(Σ) corresponding to all the variables in
x. When auxiliary variables exist in x, one may pick the subset of β corresponding to the
variables that one is interested in using for further analysis. For any analysis related to mean
comparisons, one can use√N(µs − µs)
L→ N(0,ΩSWNµs)
for related statistical inference, where µs is the selected subset of µ and ΩSWNµsis the
selected submatrix of ΩSWN . For example, one can use
Tµ = N(µs − µs0)′Ω−1
SWNµs(µs − µs0)
L→ χ2ps
to test the hypothesis µs = µs0, where ps is the number of parameters in µs. Similarly, for
any analysis related to modeling covariance matrices, one can use
√N(σs −σs)
L→ N(0,ΩSWNσs) (16)
to derive the properties of parameter estimates or test statistics for overall model evalua-
tion. When performing a simultaneous analysis of mean and covariance structures based on
selected variables, one can use
√N(βs − βs)
L→ N(0,ΩSWNs) (17)
to obtain the properties of parameter estimates and test statistics for the overall model.
In the context of SEM with complete data, the rescaled statistic developed in Satorra and
Bentler (1994), the asymptotically distribution free (ADF) statistics developed in Browne
(1984) and Yuan and Bentler (1997, 1998) are all based on results parallel to (16) and
(17). In the context of exploratory factor analysis or multiple-group analysis with missing
data, nonnormal data or data with outliers, standard errors for parameter estimates and
test statistics for overall model evaluation are also based on results parallel to (16) and (17)
(Yuan & Bentler, 2001; Yuan, Marshall & Bentler, 2002). For samples from nonnormal
12
population distributions and missing data being MCAR, statistics for overall model evalu-
ation and consistent SEs have been developed in Yuan and Bentler (2000). These can be
equally developed using (17), which needs missing data being MAR. Once (17) holds and
a consistent ΩSWNsis available, missing data, missing data mechanism or the distribution
of the population is no longer relevant. Actually, most procedures for mean and covariance
structure analysis with complete data are solely based on results parallel to (17).
When there are no missing data in the sample, the MLEs of µ and Σ are given by
the sample mean vector x and sample covariance matrix S =∑N
i=1(xi − x)(xi − x)′/N , or
β = (x′, vech′(S))′. To better understand the result in (13) and (17), we also obtain the
counterpart of ΩSWN in (13) under the complete data here. For complete data, µi = Ip,
Σi = Σ, σ = Ip∗ , and
Wi = W =1
2D′
p(Σ−1 ⊗ Σ−1)Dp,
where Ip∗ is the identity matrix of order p∗ = p(p + 1)/2. Let W = D′p(S
−1 ⊗ S−1)Dp/2,
zi = vech[(xi − x)(xi − x)′], and the sample mean vector and covariance matrix of zi be z
and Szz . Then s = vech(S) = z, and the sample covariance matrix of xi with zi is
Sxz =1
N
N∑
i=1
(xi − x)(zi − s)′.
It follows from (14) and (15) that
AN =
S−1 0
0 W
and BN =
S−1 S−1SxzW
WS′
xzS−1 WSzzW
.
Thus,
ΩSWN =
S Sxz
S′xz Szz
=1
N
N∑
i=1
(ui − u)(ui − u)′,
where
ui =
(
xi
zi
)
and u =
(
x
z
)
.
So, for complete data, ΩSWN is just the sample covariance matrix Suu of ui. The matrix Suu
is a necessary component for ADF procedures with complete data (Browne, 1984; Bentler &
Yuan, 1999; Yuan & Bentler, 1998); it is also used to obtain consistent SEs and a rescaled
statistic (Satorra & Bentler, 1994). Parallel statistics and standard errors are obtained for
13
missing data when Suu is replaced by ΩSWNs. Thus, we will refer readers to the technical
development for complete data and illustrate the applications with missing data through an
example in the next section.
4. A SAS IML Program and the EQS Code
We will introduce a SAS IML program that generates the µ and Σ using the EM-
algorithm (Dempster, Laird & Rubin, 1977), and ΩSWN using (13) to (15). We will also
provide EQS codes for SEM using the result (16) or (17). These programs will be introduced
through the following example.
Example. Table 1.2.1 of Mardia et al. (1979) contains test scores of N = 88 students on
five subjects. The five subjects are: Mechanics, Vectors, Algebra, Analysis, and Statistics.
The first two subjects were tested with closed book exams and the last three were tested
with open book exams. Tanaka, Watadani and Moon (1991) proposed to fit the data set
by a two-factor model with the first factor representing the latent score on “closed book”
and the second factor representing the latent score on “open book”. Let y be the vector of
Mechanics, Vectors, Analysis, and Statistics, we fit the four variables by the following factor
model
y = Λf + e
with mean and covariance structures
m(θ) = Λτ and C(θ) = ΛΦΛ′ + Ψ, (18)
where τ = E(f) = (τ1, τ2)′,
Λ =
(
1.0 λ21 0 00 0 1.0 λ42
)′
,
Φ is a covariance matrix with φ11 = Var(f1), φ12 = φ21 = Cov(f1, f2), φ22 = Var(f2), and
Ψ = diag(ψ11, ψ22, ψ33, ψ44). There are q = 11 parameters in the model with
θ = (τ1, τ2, λ21, λ42, φ11, φ21, φ22, ψ11, ψ22, ψ33, ψ44)′.
With ps = 4 variables, the model degrees of freedom are p∗s + ps − q = 3. The normal
distribution-based LR statistic is TML = 3.259, with an associated p-value=0.353 when
referred to χ23, suggesting that the model in (18) fits the data very well.
14
We use the variable Algebra2 to create missing data schemes, thus x3 is an auxil-
iary variable. When x2 = Vectors and x5 = Statistics, corresponding to the smallest 31
scores of x3 = Algebra, are removed and the variable Algebra is excluded from the anal-
ysis, the missing data mechanism is MNAR. The missing data mechanism is MAR when
the five variables are considered simultaneously. The created data set can be found at
www.nd.edu/∼kyuan/missingdata/Mardiamv25.dat, with −99 for missing values.
Appendix C contains parts of a SAS IML program that performs the first-stage analysis
of the two-stage ML. The whole program can be found at www.nd.edu/∼kyuan/missingdata
/twosML.sas. The program calculates the MLEs of µ and Σ using all the variables, 5 in
the example. It also calculates the sandwich-type covariance matrix ΩSWN of√N β. When
the interest is in a subset of the variables to perform the second-stage analysis, 4 variables
in the example, one only needs to specify a subset of the subscripts corresponding to these
variables in order for the program to print the corresponding subvector and submatrices of
µ, Σ, and ΩSWN . The first five statements in appendix C are for reading a raw data set.
The statement
filename data ’d:\missingdata\mardiaMV25.dat’;
tells the program where the data set is located. The numbers in the data set need to be
separated by space and saved as an ASCII or txt format. Each missing value is coded3 as
-99. The statement
input v1 v2 v3 v4 v5;
tells the program that there are five variables in the data set. These two statements need to
be modified according to the location and number of variables in a particular application.
Starting from row 6 of appendix C is the main program4 of the first stage of the two-stage
ML. The statement
V_forana=1, 2, 4, 5;
2The variable Algebra is chosen because TML is highly significant when it is included in model (18).3The value can be easily changed in the program if -99 is a possible value for real data.4The main program is at the end of the file twosML.sas on the web.
15
tells the SAS program that variables 1, 2, 4, and 5 are the variables for analysis in the
second stage. Variable x3 is an auxiliary variable that is included in the estimation. But
the estimates of its mean, variance, covariances and related elements in ΩSWN are excluded
from the output. We have used vech(Σ) to stack the columns of the lower-triangular portion
of Σ. Instead of stacking the columns, some programs may stack the rows of the lower-
triangular portion of Σ in the computation. For example, EQS uses the vector σEQS =
(σ11, σ21, σ22, σ31, σ32, σ33, . . . , σpp)′ for covariance structure analysis and βEQS = (σ′
EQS,µ′)′
for mean and covariance structure analysis; ΩEQS also needs to have an extra row and
column of zeros in order to generate correct SEs and statistics. A permutation of the rows
and columns of ΩSWNsor ΩSWNσs
is needed in order for proper implementation of the
second-stage analysis in programs that stack the rows of the lower-triangular portion of Σ.
The two statements
homega_swc=permuc*homega_swc*permuc‘;
homega_sw=(permu*homega_sw*permu‘||j(pvs,1,0))//(j(1,pvs,0)||1);
are to perform the needed operations in ΩSWNσsand ΩSWNs
for programs like EQS. For a
SEM program that uses vech(Σ) or (µ′, vech′(Σ))′ for analysis, these two statements are not
needed and should be removed. In summary, if using EQS for the second-stage analysis, one
only needs to modify three statements in this SAS program for any specific application. One
may need to interfere with five statements if another program is used.
Applying this program to the created missing data generates
#total observed patterns= 2
cases--#observed V--observed V--missing V=57 5 1 2 3 4 531 3 1 3 4 2 5
hat\mu_s=38.954545 50.58184 46.681818 40.912445
hat\Sigma_s=302.29339 142.19539 105.06508 106.37877142.19539 197.43999 100.4684 112.94803105.06508 100.4684 217.87603 187.67792106.37877 112.94803 187.67792 376.62339
Following this in the output is the p∗s × p∗s matrix ΩSWNσs. The next and last matrix in the
output is the (p∗s + ps + 1) × (p∗s + ps + 1) matrix ΩEQS. Because these two matrices are
16
relatively large, we omit them here to save space. The whole output with name TwosML.lst
can be found at the same address on the web as of the SAS program.
The first number in the above output, 2, is the number of total observed patterns in the
original sample, call it nop. The second part contains nop rows. Each row contains p + 2
numbers regarding the missing data information for a particular pattern. The first number
is the observed cases in the pattern, the second is the number of observed variables in the
pattern, call it pi. The next pi numbers are the set of indices for the observed variables. The
last p− pi numbers are the set of indices for the missing variables.
For a program that generates SEs based on the sandwich-type covariance matrix (e.g.,
EQS, Mplus), one can easily get the SEs of all the parameter estimates of the saturated
model. These SEs correspond to the square root of the elements on the diagonal of ΩSWN .
But they are not enough for calculating ΩSWN , as mentioned in the introduction section.
The output from the program twosML.sas can be used to perform the second-stage anal-
ysis by any SEM software that allows the input of ΩSWNsto generate SEs based on the
sandwich-type covariance matrix and rescaled or ADF type statistics. An EQS program for
such an analysis is provided in appendix D, where the matrix ΩEQS =“ΩSWNs” is read into
the program from a file5 named ‘d:\missingdata\homega.dat’. Again, the numbers in the
file are in txt format and separated by space. The sample size is put at N = 88, same as
that for the complete data. Submitting this program to EQS 6 (Bentler, 2008) generates
fives numbers for each free parameter (represented by a ∗ in the code in appendix D) as in
VECTORS =V2 = 1.290*F1 + 1.000 E2.049
26.067@( .046)( 28.251@
The number 1.290 is the two-stage MLE λ21, the two numbers immediately below it are
the SE and the associated z-score, using the normal distribution-based information matrix
and treating Σ as a sample covariance matrix. They should be ignored because the SE is
not consistent. The two numbers in parentheses are the SE based on the sandwich-type
5The file only contains the (ps + p∗s
+ 1) × (ps + p∗s
+ 1) numbers that belong to “ΩSWNs”= ΩEQS from
the output of twosML.sas. The p∗s × p∗s numbers corresponding to ΩSWNσsshould be saved separately for
only covariance structure analysis, as in appendix E.
17
Table 1. Parameter estimates θ, their SEs and z-scorescomplete data missing data
θ θ SE z θ SE zτ1 39.263 1.729 22.707 39.187 1.748 22.416τ2 46.634 1.588 29.374 46.660 1.585 29.435λ21 1.288 0.049 26.531 1.290 0.046 28.251λ42 0.910 0.031 28.919 0.882 0.026 33.681φ11 86.752 22.266 3.896 102.040 33.993 3.002φ21 78.498 16.598 4.729 81.852 23.668 3.458φ22 164.396 28.854 5.698 200.402 54.635 3.668ψ11 191.417 26.350 7.264 182.097 29.874 6.095ψ22 30.989 20.381 1.520 30.703 25.392 1.209ψ33 57.904 24.804 2.334 19.499 51.742 0.377ψ44 147.017 23.294 6.311 199.950 38.821 5.151
(b) Statistics for overall model evaluationcomplete data missing data
TRML TCRADF TRF TRML TCRADF TRF
T 3.055 2.461 .825 1.361 1.285 0.425p 0.383 0.482 .484 0.715 0.733 0.736
covariance matrix and the associated z-score. These are consistent and should be used when
inferring the significance of λ21. Table 1(a) contains θ, the consistent SEs and the associated
z-scores for both the complete data and missing data. The SEs and z-scores for complete data
are also based on the sandwich-type covariance matrix to facilitate the comparison. Most of
the parameter estimates under missing data are comparable to those under complete data
due to their being consistency. Some also showed a substantive difference. For example,
ψ33 under complete data is statistically significant at .05 level as judged by its z-score, but
it is not significant under missing data. The smaller z-score is due to a smaller estimate
together with a greater SE. Actually, most of the SEs under missing data are a lot greater
than those under complete data, due to loss of information caused by missing data. Note
that the variable y3 = Analysis does not contain missing values, due to missing data in other
variables, yet ψ33 is still strongly affected.
Submitting this program to EQS 6.1 also generates six statistics for the overall model
evaluation. The three that we recommended are the rescaled statistic TRML, the residual-
based corrected ADF statistic TCRADF , and the residual-based F -statistic TRF . These appear
18
in EQS output respectively as
SATORRA-BENTLER SCALED CHI-SQUARE = 1.3605 ON 3 DEGREES OF FREEDOMPROBABILITY VALUE FOR THE CHI-SQUARE STATISTIC IS .71482
YUAN-BENTLER RESIDUAL-BASED TEST STATISTIC = 1.285PROBABILITY VALUE FOR THE CHI-SQUARE STATISTIC IS .73266
YUAN-BENTLER RESIDUAL-BASED F-STATISTIC = .425DEGREES OF FREEDOM = 3, 85PROBABILITY VALUE FOR THE F-STATISTIC IS .73579
The statistic TRML does not approach the nominal chi-square distribution in general, instead,
it approaches a distribution that is a linear combination of chi-square variates and with
the same expected value as that of the nominal chi-square. Many simulation results with
complete data indicate that TRML performs quite well at finite samples (e.g., Hu, Bentler
& Kano, 1992). The statistic TCRADF asymptotically follows a chi-square distribution and
performs quite well with complete data at finite sample sizes (Bentler & Yuan, 1999; Yuan
& Bentler, 1998). The statistic TRF also possesses the property of ADF and performs well
at small sample sizes with complete data (Bentler & Yuan, 1999; Yuan & Bentler, 1998).
We expect these statistics will perform similarly with missing data for two reasons: (1) the
asymptotic properties of the three statistics also hold for the two-stage ML with missing data,
that is, TRML approaches a distribution with the same expected value as that of the nominal
chi-square; TCRADF and TRF are asymptotically distribution free; (2) as the percentage of
missing values approaches zero the three statistics automatically become their complete-data
counterparts.
The three statistics for complete data and missing data are reported in Table 1(b). Below
each statistic is the associated p-value by referring TRML or TCRADF to χ23, and TRF to F3,85,
the F -distribution with p∗s +ps−q and N−(p∗s +ps−q) degrees of freedom. Like the statistic
TML for complete data, the three statistics in Table 1(b) also suggest that the model in (18)
fits the data well. None of the statistics in Table 1(b) under complete data assumes a known
population distribution, their p-values are slightly greater than that corresponding to TML.
Under missing data, the statistics in Table 1(b) also suggest the model fits the sample very
well. Actually, their p-values are greater than those under the complete data.
Although the numbers reported in Table 1 under missing data are all asymptotically
19
valid, some have sizeable differences with those for complete data, due to sampling errors at
finite sample sizes.
Appendix E contains the EQS code for only covariance structure analysis. The matrix
ΩSWNσsis read in from an external file. Because the structure of the code and the output
are similar to that in appendix D, we will not discuss the details. Interested readers are
referred to Bentler (2008).
In practice, a data set may contain variables xj and xk that have never been simultane-
ously observed on any participants. Then the parameter σjk is not estimable and the AN
in (13) is literally singular. In such a case, the SAS program will be unable to generate the
desired Σs or ΩSWNs.
5. Discussion and Conclusion
In social and behavioral sciences, data sets from normal distributions are rare (Micceri,
1989). For a sample with missing values that are MCAR or MAR, if its population distribu-
tion is known, the ML based on the true distribution is the preferred method of analysis. It
is most likely that we do not know the population distributions for most practical data sets.
In such a situation, most researchers will choose the ML based on the normal distribution
assumption for analysis. The aim of the paper is to redirect such a practice to a statistically
sound two-stage ML. Under the condition described in section 3, the normal distribution-
based MLEs of the means and covariances are still consistent. The asymptotic covariance
matrix of the MLEs can be consistently estimated. At the end of the first stage, the problem
of mean and covariance structure analysis with missing data becomes the same as that for
complete data.
Although it has been suggested that one should include as many auxiliary variables as
possible, including too many may create problems if they are collinear themselves or with the
substantive variables. Then the matrix AN will be near singular and ΩSWN will be inflated.
If possible, one should selectively choose auxiliary variables that are most relevant or closely
related to those that are going to be analyzed at stage-2.
Enders and Peugh (2004) studied a two-stage ML for missing data. They only focused
on adjusting the sample size required at stage-2, not paying attention to alternative SEs
20
or test statistics. None of their suggested procedures leads to valid inference even when
data are normally distributed. Savalei and Bentler (2007) studied the two-stage ML with
missing data and normally distributed population, using ΩSWN = A−1N . They found that the
rescaled statistic and a residual-based generalized least squares (GLS) statistic performed
well. Because it needs the normal distribution assumption to justify the GLS statistic, the
good performance of the GLS statistic may not be found when the population distribution
is not normal. With a nonnormal population distribution, the sandwich-type covariance
matrix given in (16) or (17) has to be used for consistent SEs and valid statistics for overall
model evaluation.
We want to note that the MAR mechanism described in section 3 is through selections
based on the linear combinations of the observed variables falling into certain intervals.
MAR mechanism can also be created by other selection processes (e.g., Schafer, 1997, p.
25). Although we suspect that the result in (13) also hold for other selection processes that
create MAR mechanism, it needs to be proved or empirically studied before claiming that
(13) holds for all MAR schemes. We also want to note that model (12) does not account
for outliers or data contamination. Any normal distribution-based procedures are no longer
reliable when data are contaminated, and the effect of data contamination can be much
worse than missing data with MNAR mechanism. When data are contaminated or contain
outliers, one may estimate the µ and Σ using the multivariate t-distribution in the first stage
(Little, 1988). With (13) and a consistent ΩSWN , the second-stage analysis is the same as
that using the normal distribution assumption given in this paper.
The literature on mean and covariance structure analysis with missing data often em-
phasizes the merit of the direct ML or the so-called full information ML. As showed in Yuan
and Bentler (2000), the direct ML does not enjoy any better property over the two-stage
ML unless the population is normally distributed. For nonnormally distributed populations,
SEs from the direct ML are not consistent; the LR statistic does not approach a chi-square
distribution. With auxiliary variables, Graham (2003) proposed to let auxiliary variables be
correlated with all measurement errors in the direct ML. But it is not clear how this will
affect the structural parameter estimates; it is not convenient to implement it either (Savalei
& Bentler, 2007).
21
Multiple imputation (MI) has been recommended for missing data when the distribution
of the population can be specified (Schafer & Graham, 2002). If the distribution of the
population is not multivariate normal and the imputed values are generated from a normal
distribution, then each sample contains a mixture of values from two distributions. It is
unlikely to get valid inference when submitting the samples to a SEM program for complete
data. For samples from an unknown population distribution, Enders (2002, 2005) gives a
bootstrap procedure to test the overall model fit. Unfortunately, the transformation in equa-
tion (4) of Enders (2002) or equation (2) of Enders (2005) does not satisfy the requirement
for bootstrap testing of the null hypothesis. When the sample size is not large enough,
the bootstrap by resampling from the original observations may provide more accurate SEs
than those based on asymptotics. But the bootstrap may suffer from nonconvergences, es-
pecially with missing data. Then the SEs based on just the converged samples will not be
reliable. So, if not explicitly modeling the missing data mechanism, the procedure developed
in this paper might be the best for SEM with missing data in practice where the population
distribution is typically unknown.
Although the asymptotic properties of the statistics TRML, TCRADF and TRF also hold for
the two-stage ML with missing data, and there is no reason for them to behave differently
with missing data, further study about their finite sample behavior is still valuable (e.g.,
Savalei & Bentler, 2007), expecially when the percentage of missing is large. For statistics
that do not perform well with complete data, it is hard to imagine that they will perform
well with missing data. Any such study should focus on statistics that perform well with
complete data (Hu, et al., 1992; Bentler & Yuan, 1999).
Acknowledgement: We are thankful to the editor and two referees for their constructive
comments that have led to a significant improvement of the paper over the previous version.
Appendix A
In this appendix we will show that∑n
j=1 g(zj1, zj2) and∑N
j=1wjg(zj1∗, zj2∗) have the same
distribution. Because a distribution and its characteristic function are uniquely determined
by each other, we only need to show that they have the identical characteristic function.
22
Let the characteristic function of g(zj1∗, zj2∗) be ϕ∗(t). Then, following the definition,
the characteristic function of∑n
j=1 g(zj1, zj2) is
Eexp[itn∑
j=1
g(zj1, zj2)] = E
Eexp[itn∑
j=1
g(zj1, zj2)]|n
= E[ϕn∗(t)]
=N∑
j=0
(
Nj
)
ϕj∗(t)F j
1 (c)[1 − F1(c)]N−j
= [1 + F1(c)ϕ∗(t)− F1(c)]N .
The characteristic function of∑N
j=1 wjg(zj1∗, zj2∗) is
Eexp[itN∑
j=1
wjg(zj1∗, zj2∗)] = (Eexp[itw1g(z11∗, z12∗)])N
= (E[Eexp[itw1g(z11∗, z12∗)]|w1])N
= [1 + F1(c)Eexp[itg(z11∗, z12∗)] − F1(c)]N
= [1 + F1(c)ϕ∗(t) − F1(c)]N .
Thus, the two characteristic functions are the same.
The proof for∑N
j=n+1 g(zj1, zj2) and∑N
j=1(1−wj)g(z∗j1, z
∗j2) to have the same distribution
is essentially the same by noticing that uj = 1 − wj also follows a Bernoulli distribution.
Appendix B
This appendix provides the details for the consistency of the sandwich-type covariance
matrix ΩSWN = A−1N BNA−1
N . We will show that AN converges to a matrix A, BN converges
to a matrix B, and the Ω in (10) equals A−1BA−1. Because AN and BN are defined by the
derivatives of the log likelihood function, we need to analytically obtain li(µ) and li(µ). It
follows from (2) that
li(µ) = Σ−1(xi −µ), li(µ) = Σ−1, i = 1, 2, . . . , n;
li(µ) =
(
(xi1 − µ1)/σ11
0
)
, li(µ) =
(
1/σ11 00 0
)
, i = n+ 1, n+ 2, . . . , N.(A1)
According to (11),
AN =n
NΣ−1 +
(
(N − n)/(Nσ11) 00 0
)
.
23
Because AN does not involve µ, AN = AN . Let
A = F1(c)Σ−1 +
(
[1 − F1(c)]/σ11 00 0
)
. (A2)
It follows from n =∑N
i=1wi ∼ B(N,F1(c)) that n/NP−→ F1(c). Thus, AN
P−→ A.
Before turning to the probability limit of BN , we need to note that the moments of the
random variables z1∗ and z∗1 introduced in section 2 are closely related to those of z1. Actually,
it follows from the definition of conditional probability P (z1 ≤ t|w = 1) = P (z1 ≤ t|z1 ≤ c)
that the CDF of z1∗ is
F1∗(t) =
F1(t)/F1(c), t ≤ c,
1, t > c.(A3)
Similarly, the CDF of z∗1 is
F ∗
1 (t) =
0, t ≤ c,
[F1(t)− F1(c)]/[1 − F1(c)], t > c.(A4)
It follows from (A3) and (A4) that E(z1∗), E(z∗1), E(z21∗) and E(z∗21 ) are all well defined, and
F1(c)E(z21∗) + [1 − F1(c)]E(z∗21 ) = E(z2
1) = 1. (A5)
Now we are ready to consider the convergence of BN . It follows from (11) and (A1) that
BN =1
N
n∑
i=1
Σ−1(xi − µ)(xi − µ)′Σ−1 +1
N
N∑
i=n+1
(xi1 − µ1)2
σ211
0
0 0
(A6)
and
BN =1
N
n∑
i=1
Σ−1(xi − µ)(xi − µ)′Σ−1 +1
N
N∑
i=n+1
(xi1 − µ1)2
σ211
0
0 0
. (A7)
Let
B = F1(c)L′−1
(
E(z21∗) 0
0 1
)
L−1 +1 − F1(c)
σ11
(
E(z∗21 ) 00 0
)
. (A8)
In the following we will first show that BNP−→ B; next we will show that BN has the
same probability as BN by showing that BN − BNP−→ 0; then we will see that the matrix
A−1BA−1 is just the Ω given in (10).
We will need to relate the xi and xi1 in (A6) to zi1 and zi2 through (4) to facilitate
obtaining the probability limit of BN . Using xi−µ = Lzi, i = 1, 2, . . . n, and xi1−µ1 = σ1zi1,
24
i = n+ 1, n+ 2, . . ., N , we can rewrite (A6) as
BN =1
N
n∑
i=1
Σ−1Lziz′
iL′Σ−1 +
1
N
N∑
i=n+1
z2i1
σ110
0 0
= Σ−1L
1
N
N∑
i=1
wi
z2i1∗ zi1∗zi2∗
zi1∗zi2∗ z2i2∗
L′Σ−1
+1
N
N∑
i=1
(1 − wi)
z∗2i1
σ110
0 0
.
(A9)
Recall that Σ−1 = L′−1L−1, E(zi2∗) = 0, E(z2
i2∗) = 1; zi1∗, zi2∗ and wi are independent; and
z∗i1 and wi are independent. Applying the law of large numbers to (A9) yields
BNP−→ B. (A10)
Turning to BN , we can rewrite (A7) as
BN =1
N
n∑
i=1
Σ−1(xi − µ + µ − µ)(xi −µ + µ − µ)′Σ−1
+1
N
N∑
i=n+1
(xi1 − µ1 + µ1 − µ1)2
σ211
0
0 0
= BN + B(1)N12 + B
(1)N21 + B
(1)N22 + 2B
(2)N12 + B
(2)N22,
(A11)
where
B(1)N12 = Σ−1[
1
N
n∑
i=1
(xi − µ)](µ − µ)′Σ−1, B(1)N21 = B
(1)′
N12,
B(1)N22 = (
1
N
n∑
i=1
1)Σ−1(µ − µ)(µ − µ)′Σ−1,
B(2)N12 =
1
N
N∑
i=n+1
(xi1 − µ1)(µ1 − µ1)
σ211
0
0 0
,
B(2)N22 = (
1
N
N∑
i=n+1
1)
(µ1 − µ1)2
σ211
0
0 0
.
We will show that B(1)N12, B
(1)N21, B
(1)N22, B
(2)N12 and B
(2)N22 all approach zero in probability. Notice
that µP−→ µ was obtained in section 2. It follows from
1
N
n∑
i=1
(xi −µ) =1
N
N∑
i=1
wiLzi∗P−→ F1(c)LE(z∗)
25
that
B(1)N12 = B
(1)′
N21P−→ 0; (A12)
it follows from1
N
n∑
i=1
1 =1
N
N∑
i=1
wiP−→ F1(c)
that
B(1)N22
P−→ 0; (A13)
it follows from
1
N
N∑
i=n+1
(xi1 − µ1) =1
N
N∑
i=1
(1 −wi)σ1z∗
i1P−→ [1 − F1(c)]σ1E(z∗1)
and1
N
N∑
i=n+1
1 =1
N
N∑
i=1
(1 − wi)P−→ [1 − F1(c)]
that
B(2)N12
P−→ 0 and B(2)N22
P−→ 0. (A14)
Combining (A10) to (A14) yields
BNP−→ B.
We still need to show that A−1BA−1 equals the Ω in (10) for the consistency of ΩSWN .
We will achieve this by working with B and showing that B = A first, then we will show
that Ω = A−1. Let d = |L| = σ1σ2(1 − ρ2)1/2, then
L−1 = d−1
σ2(1 − ρ2)1/2 0
−σ2ρ σ1
.
It follows from
E(z21∗) 0
0 1
=
1 0
0 1
+
E(z21∗) − 1 0
0 0
that
L′−1
E(z21∗) 0
0 1
L−1 = Σ−1 +
[E(z21∗) − 1]/σ11 0
0 0
. (A15)
Using (A5) and combining (A8) and (A15) yield
B = F1(c)Σ−1 +
[1 − F1(c)]/σ11 0
0 0
.
26
Thus, B = A and
ΩSWNP−→ ΩSW = A−1.
We only need to show A−1 = Ω in order for ΩSWNP−→ Ω. It follows from
Σ−1 =1
d2
(
σ22 −σ12
−σ12 σ11
)
that we can write the A in (A2) as
A =
F1(c)σ22/d2 + [1 − F1(c)]/σ11 −F1(c)σ12/d
2
−F1(c)σ12/d2 F1(c)σ11/d
2
.
Let h2 = |A|, then
h2 = F1(c)σ22/d2 + [1 − F1(c)]/σ11F1(c)σ11/d
2 − F 21 (c)σ2
12/d4
=F 2
1 (c)
d4(σ11σ22 − σ2
12) +[1 − F1(c)]F1(c)
d2
=F 2
1 (c) + F1(c)[1 − F1(c)]
d2
=F1(c)
d2.
Consequently,
A−1 =1
h2
F1(c)σ11/d2 F1(c)σ12/d
2
F1(c)σ12/d2 F1(c)σ22/d
2 + [1 − F1(c)]/σ11
=
σ11 σ12
σ12 σ221 + [1/F1(c) − 1](1 − ρ2)
.
(A16)
Comparing (A16) with (10), we have A−1 = Ω. Thus, ΩSWN is consistent for Ω.
27
Appendix C
data raw;
filename data ’d:\missingdata\mardiaMV25.dat’; *need to be modified;
infile data;
input v1 v2 v3 v4 v5; *need to be modified;
run;
*-------------------------------------------------------------*;
use raw;
read all var _num_ into x;
close raw;
n=nrow(x);
p=ncol(x);
V_forana=1, 2, 4, 5; *need to be specified;
p_v=nrow(V_forana);
pvs=p_v+p_v*(p_v+1)/2;
run pattern(n,p,x,misinfo);
totpat=nrow(misinfo);
print "#total observed patterns=" totpat;
print "cases--#observed V--observed V--missing V=";
print misinfo;
run emmus(n,p,x,misinfo,hmu1,hsigma1);
hmu_s=hmu1[V_forana]‘;
hsigma_s=hsigma1[V_forana,V_forana];
print "hat\mu_s=";
print hmu_s;
print "hat\Sigma_s=";
print hsigma_s;
run Omega(n,p,hmu1, hsigma1,x,misinfo, omega_sw);
run indexv(p,V_forana,index_s); *index for both means and covariances;
run indexvc(p,V_forana,index_sc); *index for only the covariances;
run switch(p_v, permuc, permu); *generating permutation matrices;
homega_swc=omega_sw[index_sc,index_sc];
homega_swc=permuc*homega_swc*permuc‘;*needed for the 2nd stage ML in EQS;
print "hat\Omega_sw\sigma_s=";
print homega_swc;
homega_sw=omega_sw[index_s,index_s];
homega_sw=( permu*homega_sw*permu‘||j(pvs,1,0) )//(j(1,pvs,0)||1);
*needed for the 2nd stage ML in EQS;
print "hat\Omega_sw_s=";
print homega_sw;
28
Appendix D
/TITLE
EQS 6.1: Mean and covariance structure analysis
using the output of twosML.sas
/SPECIFICATION
weight=’d:\missingdata\homega.dat’;
cases=88; variables=4; matrix=covariance;
analysis=moment; methods=ML, robust;
/LABELS
V1=Mechanics; V2=Vectors; V3=Analysis; V4=Statistics;
/EQUATIONS
V1= F1+E1;
V2= *F1+E2;
V3= F2+E3;
V4= *F2+E4;
F1= *V999+D1;
F2= *V999+D2;
/VARIANCES
E1-E4= *;
D1=*;
D2=*;
/COVARIANCES
D1,D2= *;
/TECHNICAL
conv=0.00000001;
/Means
38.954545 50.58184 46.681818 40.912445
/MATRIX
302.29339 142.19539 105.06508 106.37877
142.19539 197.43999 100.4684 112.94803
105.06508 100.4684 217.87603 187.67792
106.37877 112.94803 187.67792 376.62339
/END
29
Appendix E
/TITLE
EQS 6.1: Covariance structure analysis using the output of twosML.sas
/SPECIFICATION
weight=’d:\missingdata\homegac.dat’;
cases=88; variables=4; matrix=covariance;
analysis=covariance; methods=ML, robust;
/LABELS
V1=Mechanics; V2=Vectors; V3=Analysis; V4=Statistics;
/EQUATIONS
V1= F1+E1;
V2= *F1+E2;
V3= F2+E3;
V4= *F2+E4;
/VARIANCES
E1-E4= *;
F1=*;
F2=*;
/COVARIANCES
F1,F2= *;
/TECHNICAL
conv=0.00000001;
/MATRIX
302.29339 142.19539 105.06508 106.37877
142.19539 197.43999 100.4684 112.94803
105.06508 100.4684 217.87603 187.67792
106.37877 112.94803 187.67792 376.62339
/END
30
References
Amemiya, T. (1973). Regression analysis when the dependent variable is truncated normal.
Econometrica, 41, 997–1016.
Anderson, T. W. (1957). Maximum likelihood estimates for the multivariate normal distribu-
tion when some observations are missing. Journal of the American Statistical Association,
52, 200–203.
Arminger, G., & Sobel, M. E. (1990). Pseudo-maximum likelihood estimation of mean and
covariance structures with missing data. Journal of the American Statistical Association,
85, 195–203.
Bentler, P. M. (2008). EQS 6 structural equations program manual. Encino, CA: Multivariate
Software.
Bentler, P. M., & Yuan, K.-H. (1999). Structural equation modeling with small samples:
Test statistics. Multivariate Behavioral Research, 34, 181–197.
Bollen, K. A. (1989). Structural equations with latent variables. New York: Wiley.
Browne, M. W. (1984). Asymptotic distribution-free methods for the analysis of covariance
structures. British Journal of Mathematical and Statistical Psychology, 37, 62–83.
Collins, L.M., Schafer, J.L., & Kam, C.K. (2001). A comparison of inclusive and restrictive
strategies in modern missing-data procedures. Psychological Methods, 6, 330–351.
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood estimation from
incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical
Society B, 39, 1–38.
Enders, C. K. (2002). Applying the Bollen-Stine bootstrap for goodness-of-fit measures
to structural equation models with missing data. Multivariate Behavioral Research, 37,
359–377.
Enders, C. K. (2005). A SAS macro for implementing the modified Bollen-Stine bootstrap
for missing data: Implementing the bootstrap using existing structural equation modeling
software. Structural Equation Modeling, 12, 620–641.
Enders, C. K., & Peugh, J. L. (2004). Using an EM covariance matrix to estimate structural
equation models with missing data: Choosing an adjusted sample size to improve the
accuracy of inferences. Structural Equation Modeling, 11, 1–19.
Ferguson, T. S. (1996). A course in large sample theory. London: Chapman & Hall.
Gourieroux, C., Monfort, A., & Trognon, A. (1984). Pseudo maximum likelihood methods:
Theory. Econometrica, 52, 681–700.
31
Graham, J. W. (2003). Adding missing-data-relevant variables to FIML-based structural
equation models. Structural Equation Modeling, 10, 80–100.
Heckman, J. J. (1979). Sample selection bias as a specification error. Econometrica, 47,
153–161.
Hoffman, P. J. (1959). Generating variables with arbitrary properties. Psychometrika, 24,
265–267.
Hu, L., Bentler, P. M. & Kano, Y. (1992). Can test statistics in covariance structure analysis
be trusted? Psychological Bulletin, 112, 351–362.
Jamshidian, M., & Bentler, P. M. (1999). Using complete data routines for ML estimation of
mean and covariance structures with missing data. Journal of Educational and Behavioral
Statistics, 23, 21–41.
Laird, N. M. (1988). Missing data in longitudinal studies. Statistics in Medicine, 7, 305–315.
Lee, W.-C., & Rodgers, J. L. (1998). Bootstrapping correlation coefficients using univariate
and bivariate sampling. Psychological Methods, 3,91–103.
Little, R. J. A. (1988). Robust estimation of the mean and covariance matrix from data
with missing values. Applied Statistics, 37, 23–38.
Little, R. J. A., & Rubin, D. B. (2002). Statistical analysis with missing data (2nd ed.). New
York: Wiley.
Mardia, K. V. (1970). Measures of multivariate skewness and kurtosis with applications.
Biometrika, 57, 519–530.
Mardia, K. V., Kent, J. T., & Bibby, J. M. (1979). Multivariate analysis. New York:
Academic Press.
Micceri, T. (1989). The unicorn, the normal curve, and other improbable creatures. Psy-
chological Bulletin, 105, 156–166.
Rotnitzky, A., & Wypij, D. (1994). A note on the bias of estimators with missing data.
Biometrics, 50, 1163–1170.
Rubin, D. B. (1976). Inference and missing data (with discussions). Biometrika, 63, 581–592.
Satorra, A. & Bentler, P. M. (1994).Corrections to test statistics and standard errors in
covariance structure analysis. In A. von Eye & C. C. Clogg (Eds.), Latent Variables
Analysis: Applications for Developmental Research (pp. 399–419). Newbury Park, CA:
Sage.
Savalei, V., & Bentler, P. M. (2007). A two-stage ML approach to missing data: Theory
and application to auxiliary variables. UCLA Statistics Electronic Publications #511.
Schafer, J. L. (1997). Analysis of incomplete multivariate data. London: Chapman & Hall.
32
Schafer, J. L., & Graham, J. W. (2002). Missing data: Our view of the state of the art.
Psychological Methods, 7, 147–177.
Schafer, J. L., & Olsen, M. K. (1998). Multiple imputation for multivariate missing-data
problems: A data analyst’s perspective. Multivariate Behavioral Research, 33, 545–571.
Tanaka, Y., Watadani, S., & Moon, S. H. (1991). Influence in covariance structure analysis:
With an application to confirmatory factor analysis. Communication in Statistics-Theory
and Method, 20, 3805–3821.
Tobin, J. (1958). Estimation for relationships with limited dependent variables. Economet-
rica, 26, 24–36.
White, H. (1982). Maximum likelihood estimation of misspecified models. Econometrica,
50, 1–25.
Yuan, K.-H. (2007). Normal theory ML for missing data with violation of distribution
assumptions. Under review.
Yuan, K.-H., & Bentler, P. M. (1997). Mean and covariance structure analysis: Theoretical
and practical improvements. Journal of the American Statistical Association, 92, 767–774.
Yuan, K.-H., & Bentler, P. M. (1998). Normal theory based test statistics in structural
equation modeling. British Journal of Mathematical and Statistical Psychology, 51, 289–
309.
Yuan, K.-H., & Bentler, P. M. (2000). Three likelihood-based methods for mean and co-
variance structure analysis with nonnormal missing data. In M. E. Sobel & M. P. Becker
(Eds.), Sociological methodology 2000 (pp. 167–202). Oxford: Blackwell.
Yuan, K.-H., & Bentler, P. M. (2001). A unified approach to multigroup structural equation
modeling with nonstandard samples. In G. A. Marcoulides & R. E. Schumacker (Eds.),
Advanced structural equation modeling: New developments and techniques (pp. 35–56).
Mahwah, NJ: Lawrence Erlbaum Associates.
Yuan, K.-H., Marshall, L. L., & Bentler, P. M. (2002). A unified approach to exploratory
factor analysis with missing data, nonnormal data, and in the presence of outliers. Psy-
chometrika, 67, 95–122.
33