SEM with Missing Data and Unknown Population Distributions

SEM with Missing Data and Unknown Population Distributions Using Two-stage ML:

Theory and Its Application∗

Ke-Hai Yuan and Laura Lu

University of Notre Dame

January 14, 2008

Revised July 24, 2008

Second revision, August 7, 2008

∗The research was supported by NSF grant DMS04-37167 and Grants DA00017 and DA01070

from the National Institute on Drug Abuse. Correspondence concerning this article: Ke-Hai

Yuan, Department of Psychology, University of Notre Dame, Notre Dame, IN 46556, USA

([email protected]).

Abstract

This paper provides the theory and application of the two-stage ML procedure for struc-

tural equation modeling (SEM) with missing data. The validity of this procedure does not

require the assumption of a normally distributed population. When the population is nor-

mally distributed and all missing data are missing at random (MAR), the direct maximum

likelihood (ML) procedure is nearly optimal for SEM with missing data. When missing data

mechanisms are unknown, including auxiliary variables in the analysis will make the missing

data mechanism more likely to be MAR. It is much easier to include auxiliary variables in

the two-stage ML than in the direct ML. Based on most recent development for missing

data with an unknown population distribution, the paper first provides the least technical

material on why the normal distribution-based ML generates consistent parameter estimates

when the missing data mechanism is MAR. The paper also provides sufficient conditions for

the two-stage ML to be a valid statistical procedure in the general case. For the application

of the two-stage ML, a SAS IML program is given to perform the first-stage analysis and

EQS codes are provided to perform the second-stage analysis. An example with open/closed

book examination data is used to illustrate the application of the provided programs. One

aim is for quantitative graduate students/applied psychometricians to understand the tech-

nical details for missing data analysis. Another aim is for applied researchers to use the

method properly.

Keywords: Asymptotic normality, auxiliary variables, consistency, missing at random, sandwich-

type covariance matrix.

1. Introduction

In social and behavioral sciences, data are typically collected using survey or question-

naire. Missing data cannot be avoided in the process of data collection, especially those

that are of a longitudinal nature. Because samples with missing values lose the balance

characteristic of their complete counterpart, special methods have to be developed for their

analysis. The method of the analysis is closely related to the reason of the missingness or

missing data mechanism. Rubin (1976) and Little and Rubin (2002, pp. 11–19) formally

defined three missing data mechanisms. Missing completely at random (MCAR) is a process

in which missingness is independent of both the observed and the missing values; missing

at random (MAR) is a process in which missingness depends on the observed values, not

the values being missed; missing not at random (MNAR) is a process in which missingness

depends on the values being missed. The missing data with MCAR and MAR mechanisms

are also referred to as ignorable nonresponses, because maximum likelihood estimates (MLE)

preserve many of their properties if these mechanisms are ignored.

Structural equation modeling (SEM) is regularly used for analyzing survey data, where

incomplete or missing data occur regularly. Many procedures have been developed for SEM

with missing data. Two noted ones are the direct maximum likelihood (ML) and the two-

stage ML using the normal distribution assumption (e.g., Jamshidian & Bentler, 1999; Savalei

& Bentler, 2007). In the direct ML, the MLEs of the structural parameters are obtained by

directly maximizing the likelihood function. For the two-stage ML, MLEs of the means and

covariances of the saturated model are obtained in the first stage, parallel to the direct ML.

The MLEs of the saturated model are further fitted by the structural model in the second

stage, using the normal distribution based discrepancy function. When the population is not

normally distributed, previous literature suggests that all missing values need to be MCAR

in order for the normal distribution based MLE to be consistent (e.g., Laird, 1988; Rotnitzky

& Wypij, 1994; Yuan & Bentler, 2000). However, recent development indicates that normal

distribution based MLE is still consistent under the MAR mechanism and nonnormally

distributed population (Yuan, 2007). One of the purposes of the paper is to introduce this

result using the simple bivariate case and provide the program set up for applying it to

1

SEM using the two-stage ML. We will show how to get consistent standard errors (SEs) for

parameter estimates and valid statistics for overall model evaluation.

When missing values are MNAR, one generally has to model the missing data mechanism

in order to obtain consistent parameter estimates. However, missing data mechanisms de-

pend on whether variables related to missingness are observed and included in the analysis.

In practice, a researcher often has no difficulty in identifying a set of interesting variables

to test a theory using SEM. It is always difficult to verify the MAR mechanism. Schafer

and Olsen (1998), Collins, Schafer and Kam (2001), and Graham (2003) suggested that one

should include as many variables as possible to maximize the chance for the MAR mechanism.

One can use the two-stage ML to implement the suggestion. In the first stage, the normal

distribution based MLEs are obtained for the means and covariances of all the variables. In

the second stage, only estimates of means and covariances corresponding to the interesting

variables are selected and fitted by the structural model. The variables that participate in

the first-stage estimation but not in the second-stage analysis are called auxiliary variables.

A recent study by Savalei and Bentler (2007), with normally distributed data, showed that

the two-stage ML is nearly optimal when including the auxiliary variables. In this paper, we

will study the two-stage ML when a sample comes from an unknown population distribution

and missing values are MAR, aiming to develop a valid procedure for SEM with typically

nonnormally distributed samples in practice (Micceri, 1989). We will present the inference

problem for SEM with missing data the same as that for complete data from an unknown

population distribution. Once unified, many procedures developed for complete data can be

equally applied for the two-stage ML with missing data.

Although there are many studies on missing data in the SEM literature, few facilitate a

good understanding of the problem and result, especially when the population is nonnormally

distributed and the missing data mechanism is MAR. For a sample with missing data from

an unknown population distribution, the results obtained by Rubin (1976) do not apply to

the direct or two-stage ML due to working with a wrong likelihood function. Arminger and

Sobel (1990) proposed to apply the so-called pseudo ML to SEM with missing data, but

they did not show why or when the pseudo ML, developed by White (1982) and Gourieroux,

2

Monfort and Trognon (1984) for complete data, can be applied to missing data. Actually,

examples exist for the pseudo ML to fail when data are nonnormally distributed and the

missing data mechanism is MAR, one will be given in section 3. Yuan (2007) provided the

conditions for normal distribution based ML to apply to a nonnormal population distribution

with MAR data, but the paper mainly dealt with saturated means and covariances, not SEM

in particular. The development is also beyond the level of most quantitative psychologists.

Actually, with missing data, even for a normally distributed population, little literature

in SEM exists that facilitates thorough understanding of issues related to proper statistical

inference; almost all are about how to obtain the MLEs and the likelihood ratio (LR) statistic.

To fill the gap, the paper will provide the technical details for inference with missing data

using a simple scenario. The prerequisite knowledge is basic calculus, linear algebra, and

a graduate course in statistics/probability. We aim to let quantitative students in social

sciences fully understand the components for missing data inference. They are consistency,

asymptotic normality and the so-called sandwich-type covariance matrix. We expect that

the material will be used as a teaching tool in SEM courses for graduate students in the

social and behavioral sciences. After obtaining the results in the simplest case, we will relate

them to parallel results in higher dimensions. Such a connection will allow readers to better

understand the limitation and strength of missing data methodology. We hope that, with

a better understanding of the problem, researchers will be able to choose the most proper

methodology for SEM with missing data instead of merely relying on the default output of

a software.

Even though the two-stage ML has a lot to offer for SEM with missing data from typ-

ically unknown population distributions in practice, standard software does not facilitate

its applications. For example, EQS and Mplus provide the MLEs and their SEs based on a

sandwich-type covariance matrix. But their outputs do not contain the whole sandwich-type

covariance matrix, which is a necessary component for the second-stage analysis. Another

contribution of this paper is to introduce a SAS IML program for the first-stage analysis.

For any missing data with auxiliary variables, by modifying three lines of the provided state-

ments, the program will generate the necessary elements for mean and covariance structure

3

analysis in the second stage. With these elements, SEM with missing data is essentially the

same as SEM with complete data for any SEM programs that allow users to input the first

four moments for analysis.

In section 2 of the paper, we show the consistency and asymptotic normality of the MLE

using the simple bivariate case. Section 3 provides the components for proper two-stage

SEM using the normal distribution based estimates of means and covariances. Section 4

introduces SAS and EQS programs for implementing the two-stage ML. An example using

the test score data of Mardia, Kent and Bibby (1979) is used to illustrate the application of

these programs. Discussion and conclusion are provided in section 5.

2. Consistency, Asymptotic Normality and Sandwich-type Covariance Matrix in a Simple

Bivariate Case

Let x = (x1, x2)′ be a 2-variate population with mean µ = (µ1, µ2)

′ and covariance matrix

Σ =

(

σ11 σ12

σ21 σ22

)

,

where the distribution of x is unknown. Only for the purpose of simplicity we assume that

Σ is known. Suppose a sample from x is obtained as

x11, . . . , xn1, x(n+1)1 . . . , xN1

x12, . . . , xn2,(1)

where the first variable is observed on all the cases while the second variable is observed only

on the first n cases. We will use this sample to show that, under very general conditions, the

normal distribution based MLE is consistent and asymptotically normally distributed, and

that the asymptotic covariance matrix of the MLE is consistently estimated by a sandwich-

type covariance matrix. In the following we will first obtain the normal distribution based

MLEs of µ1 and µ2. Then we will study the properties of the MLEs.

Let xi = (xi1, xi2)′ for i = 1, 2, . . ., n; σ1 = σ

1/211 , σ2 = σ

1/222 , and ρ = σ12/(σ1σ2). After

omitting a constant, the normal distribution-based log likelihood function for the sample in

(1) is given by

l(µ) =N∑

i=1

li(µ),

4

where

li(µ) = −1

2(xi − µ)′Σ−1(xi − µ)

= − 1

2(1 − ρ2)[(xi1 − µ1)

2

σ11− 2ρ(xi1 − µ1)(xi2 − µ2)

σ1σ2+

(xi2 − µ2)2

σ22],i = 1, 2 . . . , n;

(2a)

li(µ) = − 1

2σ11(xi1 − µ1)

2, i = n+ 1, n+ 2, . . . , N. (2b)

Notice that l(µ) is a quadratic function of µ1 and µ2. Setting the derivatives of l(µ) with

respect to µ1 and µ2 equal to zero leads to two linear equations for µ1 and µ2. Let

x1 =1

N

N∑

i=1

xi1, x1∗ =1

n

n∑

i=1

xi1, x2∗ =1

n

n∑

i=1

xi2.

The MLEs

µ1 = x1 and µ2 = x2∗ + b(x1 − x1∗) (3)

are obtained by directly solving these equations, where b = σ12/σ11 is the slope of x2 regress-

ing on x1.

We will next show that µ2 is consistent and (µ1, µ2) are jointly asymptotically normally

distributed when the missing values in (1) are MAR. For the sample in (1), we need to have

a data model to introduce the MAR mechanism. Let z1 and z2 be independent random

variables, each having a mean 0 and variance 1.0. We may denote them as z1 ∼ F1(t) and

z2 ∼ F2(t) with F1 and F2 being two arbitrary cumulative distribution functions (CDF). A

widely used population model for bivariate random variables is (e.g., Hoffman, 1959; Lee &

Rodgers, 1998; Little & Rubin, 2002, p. 23)

x1 = µ1 + σ1z1, x2 = µ2 + σ2[ρz1 + (1 − ρ2)1/2z2]; (4a)

or

x = µ + Lz, (4b)

where

L =

(

σ1 0σ2ρ σ2(1 − ρ2)1/2

)

and z = (z1, z2)′. Let the sample in (1) be generated according to

w =

1 if z1 ≤ c,0 if z1 > c,

(5)

5

where w = 1 indicates that x2 is observed and w = 0 indicates that x2 is missing. Since

z1 > c is equivalent to x1 > (µ1 +σ1c), the missing values within the sample in (1) are MAR.

Corresponding to the observed and missing data are two random vectors z∗ = (z∗1, z∗2)′ and

z∗ = (z∗1 , z∗

2)′, that is, z∗ ∼ (z|w = 1) and z∗ ∼ (z|w = 0). These conditional random vectors

will facilitate obtaining the consistency of the MLEs and the sandwich-type covariance matrix

using the law of large numbers, and asymptotic normality of the MLEs using the central limit

theorem. Since z1 and z2 are independent, so are w and z2. Consequently, z2∗ = (z2|w =

1) = z2 ∼ F2(t), z∗2 = (z2|w = 0) = z2 ∼ F2(t), and z2∗ and z∗2 are independent with z1∗ and

z∗1 .

Let (zi, wi), i = 1, 2, . . ., N , be a random sample from (z, w) through which the sample

in (1) is created. Corresponding to the sample (zi, wi) are independent random vectors zi∗ =

(zi1∗, zi2∗)′ from the population z∗ and z∗i = (z∗i1, z

∗i2)

′ from the population z∗. Obviously,

n =∑N

i=1 wi is random and follows the binomial distribution B(N,F1(c)). Not so obvious

is that, for a continuous function g(t1, t2),∑n

i=1 g(zi1, zi2) and∑N

i=1 wig(zi1∗, zi2∗) have the

same distribution;∑N

i=n+1 g(zi1, zi2) and∑N

i=1(1 − wi)g(z∗i1, z

∗i2) have the same distribution.

These relationships will also be used to study the consistency and asymptotic normality of

µ2. A proof is provided in appendix A for readers who may have the interest to know the

details behind the equal distributions.

2.1 Consistency

Obviously, µ1 = x1 is consistent according to the law of large numbers. So we only need

to show the consistency of µ2. We can write the µ2 in (3) as

µ2 = bx1 + (x2∗ − bx1∗). (6)

Because x1 is consistent for µ1, we only need to find out the probability limit of the second

term on the right side of (6). Using (4a), we have

x2∗ − bx1∗ = (µ2 − bµ1) +1

n

n∑

i=1

σ2[ρzi1 + (1 − ρ2)1/2zi2] − bσ1zi1

= (µ2 − bµ1) +σ2(1 − ρ2)1/2

n

n∑

i=1

zi2

= (µ2 − bµ1) +σ2(1 − ρ2)1/2

∑Ni=1 wi/N

1

N

N∑

i=1

wizi2∗,

(7)

6

where the second equal sign is due to σ2ρ = bσ1. Notice that wi and zi2∗ are independent

with E(wi) = F1(c) and E(zi2∗) = 0. According to the law of large numbers,

1

N

N∑

i=1

wiP−→ F1(c) and

1

N

N∑

i=1

wizi2∗P−→ 0,

whereP−→ is the notation for convergence in probability. The consistency of µ2 follows from

(6), (7) and x1P−→ µ1.

2.2 Asymptotic normality

We will use the central limit theorem to show that µ1 and µ2 are jointly asymptotically

normally distributed. Let µ = (µ1, µ2)′, we will need to relate µ−µ to zi1, wi and zi2∗ in order

to apply the central limit theorem for independent and identically distributed observations.

It follows from (4a) that

x1 = µ1 +σ1

N

N∑

i=1

zi1. (8)

Combining (3), (7) and (8) yields

µ −µ = HN1

N

N∑

i=1

(

zi1

wizi2∗

)

, (9)

where

HN =

(

σ1 0bσ1 (

∑Ni=1 wi/N)−1σ2(1 − ρ2)1/2

)

.

Notice that the vectors (zi1, wizi2∗)′, i = 1, 2, . . ., N , are independent and identically dis-

tributed with mean (0, 0)′. It follows from the central limit theorem that

1√N

N∑

i=1

(

zi1

wizi2∗

)

converges to a 2-variate normal distribution with mean (0, 0)′ and variance-covariance matrix

Π = E

(

z21 z1wz2∗

wz2∗z1 w2z22∗

)

.

Recall that w is either 0 or 1 with E(w) = E(w2) = F1(c), and w is independent with

z2∗ ∼ F2(t). Thus,

Π =

(

1 00 F1(c)

)

.

It is easy to see that

HNP−→ H =

(

σ1 0bσ1 σ2(1 − ρ2)1/2/F1(c)

)

.

7

It follows from (9) and the well-known Slutsky theorem1 (e.g., Ferguson, 1996, pp. 39–42)

that√N(µ − µ)

L→ N(0,Ω), (10a)

whereL→ denotes convergence in distribution and

Ω = HΠH′ =

(

σ11 σ12

σ21 σ22 1 + [1/F1(c) − 1](1 − ρ2)

)

. (10b)

The asymptotic normality of the MLE has been established.

2.3 Sandwich-type covariance matrix

The Ω in (10) can be estimated by the so-called sandwich-type covariance matrix. Ac-

tually, only in the simplest cases can we estimate Ω directly, as in (10b) when replacing

F1(c) by n/N . The sandwich-type covariance matrix allows us to easily obtain a consistent

estimator for Ω in the general case.

The sandwich-type covariance matrix involves the first and second derivatives of the log

likelihood function with respect to the unknown parameters. For the li(µ) in (2), these are

given by the 2 × 1 vector li(µ) = ∂li(µ)/∂µ and the 2 × 2 matrix li(µ) = ∂2li(µ)/∂µ∂µ′.

Let

AN =1

N

N∑

i=1

li(µ) and BN =1

N

N∑

i=1

li(µ)l′i(µ). (11a)

The sandwich-type covariance matrix is given by

ΩSWN = A−1N BNA−1

N , (11b)

where AN and BN are obtained when the µ in (11a) is replaced by µ.

Because the proof of the consistency of ΩSWN involves a lot of notation and algebraic

operations, we put it in appendix B, where essentially the same technique is used as for

showing the consistency of µ.

We have assumed Σ known throughout this section for simplicity. Parallel results also

hold when Σ is estimated. In particular, for the observed sample in (1), analytical formulas

for the MLEs µ1, µ2, σ11, σ12, and σ22 are provided in Anderson (1957). Under the population

1The theorem states that, if xn

L→ x, an converges in probability to a and bn converges in probability to

b, anxn + bn

L→ ax + b.

8

model (4) and (5), and the condition that z1 ∼ F1(t) and z2 ∼ F2(t) have finite fourth-order

moments, the MLEs are consistent and asymptotically normally distributed. The asymptotic

covariance matrix is consistently estimated by a sandwich-type covariance matrix parallel to

(11).

3. Missing Data in the General Case and Inference in SEM

Let x be a p-variate population with E(x) = µ and Cov(x) = Σ. Let z1, z2, . . ., zp be

independent random variables with mean 0, variance 1 and have finite fourth-order moments;

L be a lower triangular matrix such that LL′ = Σ. Parallel to (4), a population model for

x is given by

x = µ + Lz, (12)

where z = (z1, z2, . . . , zp)′. Because the distributions of z1 to zp are arbitrary other than

subject to finite fourth-order moments, (12) includes an infinite number of distributions. A

normally distributed x corresponds to z ∼ Np(0, I). Let x1, x2, . . ., xN denote the observed

vectors for a sample from x with missing values. Corresponding to each xi is a vector zi =

(zi1, zi2, . . . , zip)′. Suppose xij1, xij2 , . . . xijk

are missing when each of zil1, zil2, . . ., zilm falls

into certain intervals. Such a kind of selected observations are widely used in the econometrics

literature to model human’s behavior (e.g., Amemiya, 1973; Heckman, 1979; Tobin, 1958),

where the distribution of x is known and the missing data mechanism is modeled. Here,

we do not know the distribution of x; because we consider the MAR mechanism, we will

not need to explicitly model it either. Let l = max(l1, l2, . . . , lm) and Ll be the upper-

left l × l submatrix of L, then (xi1, xi2, . . . , xil)′ = Ll(zi1, zi2, . . . , zil)

′. Although z is a

latent vector, all the information regarding (zi1, zi2, . . . , zil)′ is contained in (xi1, xi2, . . . , xil)

′.

When l < min(j1, j2, . . . , jk) and (xi1, xi2, . . . , xil)′ is observed, all the information related to

missing values are observed, thus the missing data mechanism is MAR (Rubin, 1976). When

l ≥ min(j1, j2, . . . , jk) or any variables in (xi1, xi2, . . . , xil)′ are missing, the probability of

missingness is related to the values of the variables being missed and the missing data

mechanism is MNAR.

Under (12) and the MAR mechanism described above, Yuan (2007) showed that the

normal distribution-based MLEs of µ and Σ are still consistent and asymptotically normally

9

distributed. Because Σ is symmetric, we let σ = vech(Σ) be the vector by stacking the

columns of the lower-triangular portion of Σ. Let β = (µ′, σ′)′ be the MLE of β = (µ′,σ′)′.

Parallel to (10) and (11), the asymptotic distribution of β is characterized by

√N(β − β)

L→ N(0,ΩSW ), (13a)

where ΩSW can be consistently estimated by

ΩSWN = A−1N BNA−1

N . (13b)

Specifically, after omitting a constant, the log likelihood function based on xi ∼ Npi(µi,Σi)

is

l(µ,Σ) =N∑

i=1

li(µ,Σ),

where pi is the number of observed variables in xi and

li(µ,Σ) = −1

2ln |Σi| −

1

2(xi −µi)

′Σ−1i (xi −µi), i = 1, 2, . . . , N.

Each µi is a subvector of µ and each Σi is a submatrix of Σ, corresponding to the observed

variables in xi. Let vec(Σi) be the p2i × 1 vector by stacking the columns of Σi, Dpi

=

∂vec(Σi)/∂σ′i, σi = ∂σi/∂σ′, µi = ∂µi/∂µ′, ⊗ be the notation for the so-called Kronecker

product (see e.g., Bollen, 1989, p. 465),

Wi =1

2D′

pi(Σ−1

i ⊗ Σ−1i )Dpi

,

and

liµ(µ,Σ) = ∂li(µ,Σ)/∂µ, liσ(µ,Σ) = ∂li(µ,Σ)/∂σ,

liµµ(µ,Σ) = ∂2li(µ,Σ)/∂µ∂µ′, liµσ(µ,Σ) = ∂2li(µ,Σ)/∂µ∂σ′,

liσµ(µ,Σ) = l′iµσ(µ,Σ), liσσ(µ,Σ) = ∂2li(µ,Σ)/∂σ∂σ′.

Then

liµ(µ,Σ) = µ′

iΣ−1i (xi − µi), liσ(µ,Σ) = σ′

iWivech[(xi − µi)(xi − µi)′ − Σi], (14)

liµµ(µ,Σ) = −µ′

iΣ−1i µi, liµσ(µ,Σ) = −µ′

iΣ−1i ⊗ [(xi − µi)

′Σ−1i ]Dpi

σi, (15a)

10

liσσ(µ,Σ) = σ′

iD′

pi[Σ−1

i (xi − µi)(xi −µi)′Σ−1

i − 1

2Σ−1

i ] ⊗Σ−1i Dpi

σi. (15b)

The matrices AN and BN in (13) are evaluated as

AN =1

N

N∑

i=1

liµµ(µ, Σ) liµσ(µ, Σ)

l′iµσ(µ, Σ) liσσ(µ, Σ)

and

BN =1

N

N∑

i=1

liµ(µ, Σ)l′iµ(µ, Σ) liµ(µ, Σ)l′iσ(µ, Σ)

liσ(µ, Σ)l′iµ(µ, Σ) liσ(µ, Σ)l′iσ(µ, Σ)

.

A sandwich-type covariance matrix for structured parameter parallel to (13) is given in

Arminger and Sobel (1990), but they did not provide conditions for the result in (13) to

hold.

The condition stated in this section allows the MAR mechanism to depend on all the

linear combinations of the previously observed variables. For such a purpose, we specified

L as a lower triangular matrix so that (zi1, zi2, . . . , zil)′ and (xi1, xi2, . . . , xil)

′ are determined

by each other. In practice, a participant may join the study after missing a few times and be

missing again. The missingness at a later stage may depend on all the previously observed

variables. We can match such a case with (12) by specifying a L whose rows corresponding

to the observed variables form the upper-left part of a lower triangular matrix. Then the

result in (13) still holds.

Although, for any µ and Σ, (12) contains infinitely many nonnormal distributions for

which (13) holds under the MAR mechanism described, the result in (13) does not hold

for all nonnormal distributions. Yuan (2007) provided the following example with p = 2,

µ1 = µ2 = 0,

L =

(

1 0 00 l22 l23

)

and z = (z1, z21 , z2)

′, where z1 and z2 are independent and each follows N(0, 1). Suppose

xi2 is missing when xi1 = zi1 is too large, the missing data mechanism is still MAR. But

the MLEs of µ2, σ12 and σ22 are no longer consistent. The problem is due to the nonlinear

relationship of x1 and x2. In such cases, including more auxiliary variables in the first stage

of the two-stage ML may mitigate the bias.

11

When all missing values are MCAR, the result in (13) holds for any nonnormally dis-

tributed population (Yuan & Bentler, 2000). There is no need for the data model (12) to be

part of the condition.

The β in (13) contains both µ and σ = vech(Σ) corresponding to all the variables in

x. When auxiliary variables exist in x, one may pick the subset of β corresponding to the

variables that one is interested in using for further analysis. For any analysis related to mean

comparisons, one can use√N(µs − µs)

L→ N(0,ΩSWNµs)

for related statistical inference, where µs is the selected subset of µ and ΩSWNµsis the

selected submatrix of ΩSWN . For example, one can use

Tµ = N(µs − µs0)′Ω−1

SWNµs(µs − µs0)

L→ χ2ps

to test the hypothesis µs = µs0, where ps is the number of parameters in µs. Similarly, for

any analysis related to modeling covariance matrices, one can use

√N(σs −σs)

L→ N(0,ΩSWNσs) (16)

to derive the properties of parameter estimates or test statistics for overall model evalua-

tion. When performing a simultaneous analysis of mean and covariance structures based on

selected variables, one can use

√N(βs − βs)

L→ N(0,ΩSWNs) (17)

to obtain the properties of parameter estimates and test statistics for the overall model.

In the context of SEM with complete data, the rescaled statistic developed in Satorra and

Bentler (1994), the asymptotically distribution free (ADF) statistics developed in Browne

(1984) and Yuan and Bentler (1997, 1998) are all based on results parallel to (16) and

(17). In the context of exploratory factor analysis or multiple-group analysis with missing

data, nonnormal data or data with outliers, standard errors for parameter estimates and

test statistics for overall model evaluation are also based on results parallel to (16) and (17)

(Yuan & Bentler, 2001; Yuan, Marshall & Bentler, 2002). For samples from nonnormal

12

population distributions and missing data being MCAR, statistics for overall model evalu-

ation and consistent SEs have been developed in Yuan and Bentler (2000). These can be

equally developed using (17), which needs missing data being MAR. Once (17) holds and

a consistent ΩSWNsis available, missing data, missing data mechanism or the distribution

of the population is no longer relevant. Actually, most procedures for mean and covariance

structure analysis with complete data are solely based on results parallel to (17).

When there are no missing data in the sample, the MLEs of µ and Σ are given by

the sample mean vector x and sample covariance matrix S =∑N

i=1(xi − x)(xi − x)′/N , or

β = (x′, vech′(S))′. To better understand the result in (13) and (17), we also obtain the

counterpart of ΩSWN in (13) under the complete data here. For complete data, µi = Ip,

Σi = Σ, σ = Ip∗ , and

Wi = W =1

2D′

p(Σ−1 ⊗ Σ−1)Dp,

where Ip∗ is the identity matrix of order p∗ = p(p + 1)/2. Let W = D′p(S

−1 ⊗ S−1)Dp/2,

zi = vech[(xi − x)(xi − x)′], and the sample mean vector and covariance matrix of zi be z

and Szz . Then s = vech(S) = z, and the sample covariance matrix of xi with zi is

Sxz =1

N

N∑

i=1

(xi − x)(zi − s)′.

It follows from (14) and (15) that

AN =

S−1 0

0 W

and BN =

S−1 S−1SxzW

WS′

xzS−1 WSzzW

.

Thus,

ΩSWN =

S Sxz

S′xz Szz

=1

N

N∑

i=1

(ui − u)(ui − u)′,

where

ui =

(

xi

zi

)

and u =

(

x

z

)

.

So, for complete data, ΩSWN is just the sample covariance matrix Suu of ui. The matrix Suu

is a necessary component for ADF procedures with complete data (Browne, 1984; Bentler &

Yuan, 1999; Yuan & Bentler, 1998); it is also used to obtain consistent SEs and a rescaled

statistic (Satorra & Bentler, 1994). Parallel statistics and standard errors are obtained for

13

missing data when Suu is replaced by ΩSWNs. Thus, we will refer readers to the technical

development for complete data and illustrate the applications with missing data through an

example in the next section.

4. A SAS IML Program and the EQS Code

We will introduce a SAS IML program that generates the µ and Σ using the EM-

algorithm (Dempster, Laird & Rubin, 1977), and ΩSWN using (13) to (15). We will also

provide EQS codes for SEM using the result (16) or (17). These programs will be introduced

through the following example.

Example. Table 1.2.1 of Mardia et al. (1979) contains test scores of N = 88 students on

five subjects. The five subjects are: Mechanics, Vectors, Algebra, Analysis, and Statistics.

The first two subjects were tested with closed book exams and the last three were tested

with open book exams. Tanaka, Watadani and Moon (1991) proposed to fit the data set

by a two-factor model with the first factor representing the latent score on “closed book”

and the second factor representing the latent score on “open book”. Let y be the vector of

Mechanics, Vectors, Analysis, and Statistics, we fit the four variables by the following factor

model

y = Λf + e

with mean and covariance structures

m(θ) = Λτ and C(θ) = ΛΦΛ′ + Ψ, (18)

where τ = E(f) = (τ1, τ2)′,

Λ =

(

1.0 λ21 0 00 0 1.0 λ42

)′

,

Φ is a covariance matrix with φ11 = Var(f1), φ12 = φ21 = Cov(f1, f2), φ22 = Var(f2), and

Ψ = diag(ψ11, ψ22, ψ33, ψ44). There are q = 11 parameters in the model with

θ = (τ1, τ2, λ21, λ42, φ11, φ21, φ22, ψ11, ψ22, ψ33, ψ44)′.

With ps = 4 variables, the model degrees of freedom are p∗s + ps − q = 3. The normal

distribution-based LR statistic is TML = 3.259, with an associated p-value=0.353 when

referred to χ23, suggesting that the model in (18) fits the data very well.

14

We use the variable Algebra2 to create missing data schemes, thus x3 is an auxil-

iary variable. When x2 = Vectors and x5 = Statistics, corresponding to the smallest 31

scores of x3 = Algebra, are removed and the variable Algebra is excluded from the anal-

ysis, the missing data mechanism is MNAR. The missing data mechanism is MAR when

the five variables are considered simultaneously. The created data set can be found at

www.nd.edu/∼kyuan/missingdata/Mardiamv25.dat, with −99 for missing values.

Appendix C contains parts of a SAS IML program that performs the first-stage analysis

of the two-stage ML. The whole program can be found at www.nd.edu/∼kyuan/missingdata

/twosML.sas. The program calculates the MLEs of µ and Σ using all the variables, 5 in

the example. It also calculates the sandwich-type covariance matrix ΩSWN of√N β. When

the interest is in a subset of the variables to perform the second-stage analysis, 4 variables

in the example, one only needs to specify a subset of the subscripts corresponding to these

variables in order for the program to print the corresponding subvector and submatrices of

µ, Σ, and ΩSWN . The first five statements in appendix C are for reading a raw data set.

The statement

filename data ’d:\missingdata\mardiaMV25.dat’;

tells the program where the data set is located. The numbers in the data set need to be

separated by space and saved as an ASCII or txt format. Each missing value is coded3 as

-99. The statement

input v1 v2 v3 v4 v5;

tells the program that there are five variables in the data set. These two statements need to

be modified according to the location and number of variables in a particular application.

Starting from row 6 of appendix C is the main program4 of the first stage of the two-stage

ML. The statement

V_forana=1, 2, 4, 5;

2The variable Algebra is chosen because TML is highly significant when it is included in model (18).3The value can be easily changed in the program if -99 is a possible value for real data.4The main program is at the end of the file twosML.sas on the web.

15

tells the SAS program that variables 1, 2, 4, and 5 are the variables for analysis in the

second stage. Variable x3 is an auxiliary variable that is included in the estimation. But

the estimates of its mean, variance, covariances and related elements in ΩSWN are excluded

from the output. We have used vech(Σ) to stack the columns of the lower-triangular portion

of Σ. Instead of stacking the columns, some programs may stack the rows of the lower-

triangular portion of Σ in the computation. For example, EQS uses the vector σEQS =

(σ11, σ21, σ22, σ31, σ32, σ33, . . . , σpp)′ for covariance structure analysis and βEQS = (σ′

EQS,µ′)′

for mean and covariance structure analysis; ΩEQS also needs to have an extra row and

column of zeros in order to generate correct SEs and statistics. A permutation of the rows

and columns of ΩSWNsor ΩSWNσs

is needed in order for proper implementation of the

second-stage analysis in programs that stack the rows of the lower-triangular portion of Σ.

The two statements

homega_swc=permuc*homega_swc*permuc‘;

homega_sw=(permu*homega_sw*permu‘||j(pvs,1,0))//(j(1,pvs,0)||1);

are to perform the needed operations in ΩSWNσsand ΩSWNs

for programs like EQS. For a

SEM program that uses vech(Σ) or (µ′, vech′(Σ))′ for analysis, these two statements are not

needed and should be removed. In summary, if using EQS for the second-stage analysis, one

only needs to modify three statements in this SAS program for any specific application. One

may need to interfere with five statements if another program is used.

Applying this program to the created missing data generates

#total observed patterns= 2

cases--#observed V--observed V--missing V=57 5 1 2 3 4 531 3 1 3 4 2 5

hat\mu_s=38.954545 50.58184 46.681818 40.912445

hat\Sigma_s=302.29339 142.19539 105.06508 106.37877142.19539 197.43999 100.4684 112.94803105.06508 100.4684 217.87603 187.67792106.37877 112.94803 187.67792 376.62339

Following this in the output is the p∗s × p∗s matrix ΩSWNσs. The next and last matrix in the

output is the (p∗s + ps + 1) × (p∗s + ps + 1) matrix ΩEQS. Because these two matrices are

16

relatively large, we omit them here to save space. The whole output with name TwosML.lst

can be found at the same address on the web as of the SAS program.

The first number in the above output, 2, is the number of total observed patterns in the

original sample, call it nop. The second part contains nop rows. Each row contains p + 2

numbers regarding the missing data information for a particular pattern. The first number

is the observed cases in the pattern, the second is the number of observed variables in the

pattern, call it pi. The next pi numbers are the set of indices for the observed variables. The

last p− pi numbers are the set of indices for the missing variables.

For a program that generates SEs based on the sandwich-type covariance matrix (e.g.,

EQS, Mplus), one can easily get the SEs of all the parameter estimates of the saturated

model. These SEs correspond to the square root of the elements on the diagonal of ΩSWN .

But they are not enough for calculating ΩSWN , as mentioned in the introduction section.

The output from the program twosML.sas can be used to perform the second-stage anal-

ysis by any SEM software that allows the input of ΩSWNsto generate SEs based on the

sandwich-type covariance matrix and rescaled or ADF type statistics. An EQS program for

such an analysis is provided in appendix D, where the matrix ΩEQS =“ΩSWNs” is read into

the program from a file5 named ‘d:\missingdata\homega.dat’. Again, the numbers in the

file are in txt format and separated by space. The sample size is put at N = 88, same as

that for the complete data. Submitting this program to EQS 6 (Bentler, 2008) generates

fives numbers for each free parameter (represented by a ∗ in the code in appendix D) as in

VECTORS =V2 = 1.290*F1 + 1.000 E2.049

26.067@( .046)( 28.251@

The number 1.290 is the two-stage MLE λ21, the two numbers immediately below it are

the SE and the associated z-score, using the normal distribution-based information matrix

and treating Σ as a sample covariance matrix. They should be ignored because the SE is

not consistent. The two numbers in parentheses are the SE based on the sandwich-type

5The file only contains the (ps + p∗s

+ 1) × (ps + p∗s

+ 1) numbers that belong to “ΩSWNs”= ΩEQS from

the output of twosML.sas. The p∗s × p∗s numbers corresponding to ΩSWNσsshould be saved separately for

only covariance structure analysis, as in appendix E.

17

Table 1. Parameter estimates θ, their SEs and z-scorescomplete data missing data

θ θ SE z θ SE zτ1 39.263 1.729 22.707 39.187 1.748 22.416τ2 46.634 1.588 29.374 46.660 1.585 29.435λ21 1.288 0.049 26.531 1.290 0.046 28.251λ42 0.910 0.031 28.919 0.882 0.026 33.681φ11 86.752 22.266 3.896 102.040 33.993 3.002φ21 78.498 16.598 4.729 81.852 23.668 3.458φ22 164.396 28.854 5.698 200.402 54.635 3.668ψ11 191.417 26.350 7.264 182.097 29.874 6.095ψ22 30.989 20.381 1.520 30.703 25.392 1.209ψ33 57.904 24.804 2.334 19.499 51.742 0.377ψ44 147.017 23.294 6.311 199.950 38.821 5.151

(b) Statistics for overall model evaluationcomplete data missing data

TRML TCRADF TRF TRML TCRADF TRF

T 3.055 2.461 .825 1.361 1.285 0.425p 0.383 0.482 .484 0.715 0.733 0.736

covariance matrix and the associated z-score. These are consistent and should be used when

inferring the significance of λ21. Table 1(a) contains θ, the consistent SEs and the associated

z-scores for both the complete data and missing data. The SEs and z-scores for complete data

are also based on the sandwich-type covariance matrix to facilitate the comparison. Most of

the parameter estimates under missing data are comparable to those under complete data

due to their being consistency. Some also showed a substantive difference. For example,

ψ33 under complete data is statistically significant at .05 level as judged by its z-score, but

it is not significant under missing data. The smaller z-score is due to a smaller estimate

together with a greater SE. Actually, most of the SEs under missing data are a lot greater

than those under complete data, due to loss of information caused by missing data. Note

that the variable y3 = Analysis does not contain missing values, due to missing data in other

variables, yet ψ33 is still strongly affected.

Submitting this program to EQS 6.1 also generates six statistics for the overall model

evaluation. The three that we recommended are the rescaled statistic TRML, the residual-

based corrected ADF statistic TCRADF , and the residual-based F -statistic TRF . These appear

18

in EQS output respectively as

SATORRA-BENTLER SCALED CHI-SQUARE = 1.3605 ON 3 DEGREES OF FREEDOMPROBABILITY VALUE FOR THE CHI-SQUARE STATISTIC IS .71482

YUAN-BENTLER RESIDUAL-BASED TEST STATISTIC = 1.285PROBABILITY VALUE FOR THE CHI-SQUARE STATISTIC IS .73266

YUAN-BENTLER RESIDUAL-BASED F-STATISTIC = .425DEGREES OF FREEDOM = 3, 85PROBABILITY VALUE FOR THE F-STATISTIC IS .73579

The statistic TRML does not approach the nominal chi-square distribution in general, instead,

it approaches a distribution that is a linear combination of chi-square variates and with

the same expected value as that of the nominal chi-square. Many simulation results with

complete data indicate that TRML performs quite well at finite samples (e.g., Hu, Bentler

& Kano, 1992). The statistic TCRADF asymptotically follows a chi-square distribution and

performs quite well with complete data at finite sample sizes (Bentler & Yuan, 1999; Yuan

& Bentler, 1998). The statistic TRF also possesses the property of ADF and performs well

at small sample sizes with complete data (Bentler & Yuan, 1999; Yuan & Bentler, 1998).

We expect these statistics will perform similarly with missing data for two reasons: (1) the

asymptotic properties of the three statistics also hold for the two-stage ML with missing data,

that is, TRML approaches a distribution with the same expected value as that of the nominal

chi-square; TCRADF and TRF are asymptotically distribution free; (2) as the percentage of

missing values approaches zero the three statistics automatically become their complete-data

counterparts.

The three statistics for complete data and missing data are reported in Table 1(b). Below

each statistic is the associated p-value by referring TRML or TCRADF to χ23, and TRF to F3,85,

the F -distribution with p∗s +ps−q and N−(p∗s +ps−q) degrees of freedom. Like the statistic

TML for complete data, the three statistics in Table 1(b) also suggest that the model in (18)

fits the data well. None of the statistics in Table 1(b) under complete data assumes a known

population distribution, their p-values are slightly greater than that corresponding to TML.

Under missing data, the statistics in Table 1(b) also suggest the model fits the sample very

well. Actually, their p-values are greater than those under the complete data.

Although the numbers reported in Table 1 under missing data are all asymptotically

19

valid, some have sizeable differences with those for complete data, due to sampling errors at

finite sample sizes.

Appendix E contains the EQS code for only covariance structure analysis. The matrix

ΩSWNσsis read in from an external file. Because the structure of the code and the output

are similar to that in appendix D, we will not discuss the details. Interested readers are

referred to Bentler (2008).

In practice, a data set may contain variables xj and xk that have never been simultane-

ously observed on any participants. Then the parameter σjk is not estimable and the AN

in (13) is literally singular. In such a case, the SAS program will be unable to generate the

desired Σs or ΩSWNs.

5. Discussion and Conclusion

In social and behavioral sciences, data sets from normal distributions are rare (Micceri,

1989). For a sample with missing values that are MCAR or MAR, if its population distribu-

tion is known, the ML based on the true distribution is the preferred method of analysis. It

is most likely that we do not know the population distributions for most practical data sets.

In such a situation, most researchers will choose the ML based on the normal distribution

assumption for analysis. The aim of the paper is to redirect such a practice to a statistically

sound two-stage ML. Under the condition described in section 3, the normal distribution-

based MLEs of the means and covariances are still consistent. The asymptotic covariance

matrix of the MLEs can be consistently estimated. At the end of the first stage, the problem

of mean and covariance structure analysis with missing data becomes the same as that for

complete data.

Although it has been suggested that one should include as many auxiliary variables as

possible, including too many may create problems if they are collinear themselves or with the

substantive variables. Then the matrix AN will be near singular and ΩSWN will be inflated.

If possible, one should selectively choose auxiliary variables that are most relevant or closely

related to those that are going to be analyzed at stage-2.

Enders and Peugh (2004) studied a two-stage ML for missing data. They only focused

on adjusting the sample size required at stage-2, not paying attention to alternative SEs

20

or test statistics. None of their suggested procedures leads to valid inference even when

data are normally distributed. Savalei and Bentler (2007) studied the two-stage ML with

missing data and normally distributed population, using ΩSWN = A−1N . They found that the

rescaled statistic and a residual-based generalized least squares (GLS) statistic performed

well. Because it needs the normal distribution assumption to justify the GLS statistic, the

good performance of the GLS statistic may not be found when the population distribution

is not normal. With a nonnormal population distribution, the sandwich-type covariance

matrix given in (16) or (17) has to be used for consistent SEs and valid statistics for overall

model evaluation.

We want to note that the MAR mechanism described in section 3 is through selections

based on the linear combinations of the observed variables falling into certain intervals.

MAR mechanism can also be created by other selection processes (e.g., Schafer, 1997, p.

25). Although we suspect that the result in (13) also hold for other selection processes that

create MAR mechanism, it needs to be proved or empirically studied before claiming that

(13) holds for all MAR schemes. We also want to note that model (12) does not account

for outliers or data contamination. Any normal distribution-based procedures are no longer

reliable when data are contaminated, and the effect of data contamination can be much

worse than missing data with MNAR mechanism. When data are contaminated or contain

outliers, one may estimate the µ and Σ using the multivariate t-distribution in the first stage

(Little, 1988). With (13) and a consistent ΩSWN , the second-stage analysis is the same as

that using the normal distribution assumption given in this paper.

The literature on mean and covariance structure analysis with missing data often em-

phasizes the merit of the direct ML or the so-called full information ML. As showed in Yuan

and Bentler (2000), the direct ML does not enjoy any better property over the two-stage

ML unless the population is normally distributed. For nonnormally distributed populations,

SEs from the direct ML are not consistent; the LR statistic does not approach a chi-square

distribution. With auxiliary variables, Graham (2003) proposed to let auxiliary variables be

correlated with all measurement errors in the direct ML. But it is not clear how this will

affect the structural parameter estimates; it is not convenient to implement it either (Savalei

& Bentler, 2007).

21

Multiple imputation (MI) has been recommended for missing data when the distribution

of the population can be specified (Schafer & Graham, 2002). If the distribution of the

population is not multivariate normal and the imputed values are generated from a normal

distribution, then each sample contains a mixture of values from two distributions. It is

unlikely to get valid inference when submitting the samples to a SEM program for complete

data. For samples from an unknown population distribution, Enders (2002, 2005) gives a

bootstrap procedure to test the overall model fit. Unfortunately, the transformation in equa-

tion (4) of Enders (2002) or equation (2) of Enders (2005) does not satisfy the requirement

for bootstrap testing of the null hypothesis. When the sample size is not large enough,

the bootstrap by resampling from the original observations may provide more accurate SEs

than those based on asymptotics. But the bootstrap may suffer from nonconvergences, es-

pecially with missing data. Then the SEs based on just the converged samples will not be

reliable. So, if not explicitly modeling the missing data mechanism, the procedure developed

in this paper might be the best for SEM with missing data in practice where the population

distribution is typically unknown.

Although the asymptotic properties of the statistics TRML, TCRADF and TRF also hold for

the two-stage ML with missing data, and there is no reason for them to behave differently

with missing data, further study about their finite sample behavior is still valuable (e.g.,

Savalei & Bentler, 2007), expecially when the percentage of missing is large. For statistics

that do not perform well with complete data, it is hard to imagine that they will perform

well with missing data. Any such study should focus on statistics that perform well with

complete data (Hu, et al., 1992; Bentler & Yuan, 1999).

Acknowledgement: We are thankful to the editor and two referees for their constructive

comments that have led to a significant improvement of the paper over the previous version.

Appendix A

In this appendix we will show that∑n

j=1 g(zj1, zj2) and∑N

j=1wjg(zj1∗, zj2∗) have the same

distribution. Because a distribution and its characteristic function are uniquely determined

by each other, we only need to show that they have the identical characteristic function.

22

Let the characteristic function of g(zj1∗, zj2∗) be ϕ∗(t). Then, following the definition,

the characteristic function of∑n

j=1 g(zj1, zj2) is

Eexp[itn∑

j=1

g(zj1, zj2)] = E

Eexp[itn∑

j=1

g(zj1, zj2)]|n

= E[ϕn∗(t)]

=N∑

j=0

(

Nj

)

ϕj∗(t)F j

1 (c)[1 − F1(c)]N−j

= [1 + F1(c)ϕ∗(t)− F1(c)]N .

The characteristic function of∑N

j=1 wjg(zj1∗, zj2∗) is

Eexp[itN∑

j=1

wjg(zj1∗, zj2∗)] = (Eexp[itw1g(z11∗, z12∗)])N

= (E[Eexp[itw1g(z11∗, z12∗)]|w1])N

= [1 + F1(c)Eexp[itg(z11∗, z12∗)] − F1(c)]N

= [1 + F1(c)ϕ∗(t) − F1(c)]N .

Thus, the two characteristic functions are the same.

The proof for∑N

j=n+1 g(zj1, zj2) and∑N

j=1(1−wj)g(z∗j1, z

∗j2) to have the same distribution

is essentially the same by noticing that uj = 1 − wj also follows a Bernoulli distribution.

Appendix B

This appendix provides the details for the consistency of the sandwich-type covariance

matrix ΩSWN = A−1N BNA−1

N . We will show that AN converges to a matrix A, BN converges

to a matrix B, and the Ω in (10) equals A−1BA−1. Because AN and BN are defined by the

derivatives of the log likelihood function, we need to analytically obtain li(µ) and li(µ). It

follows from (2) that

li(µ) = Σ−1(xi −µ), li(µ) = Σ−1, i = 1, 2, . . . , n;

li(µ) =

(

(xi1 − µ1)/σ11

0

)

, li(µ) =

(

1/σ11 00 0

)

, i = n+ 1, n+ 2, . . . , N.(A1)

According to (11),

AN =n

NΣ−1 +

(

(N − n)/(Nσ11) 00 0

)

.

23

Because AN does not involve µ, AN = AN . Let

A = F1(c)Σ−1 +

(

[1 − F1(c)]/σ11 00 0

)

. (A2)

It follows from n =∑N

i=1wi ∼ B(N,F1(c)) that n/NP−→ F1(c). Thus, AN

P−→ A.

Before turning to the probability limit of BN , we need to note that the moments of the

random variables z1∗ and z∗1 introduced in section 2 are closely related to those of z1. Actually,

it follows from the definition of conditional probability P (z1 ≤ t|w = 1) = P (z1 ≤ t|z1 ≤ c)

that the CDF of z1∗ is

F1∗(t) =

F1(t)/F1(c), t ≤ c,

1, t > c.(A3)

Similarly, the CDF of z∗1 is

F ∗

1 (t) =

0, t ≤ c,

[F1(t)− F1(c)]/[1 − F1(c)], t > c.(A4)

It follows from (A3) and (A4) that E(z1∗), E(z∗1), E(z21∗) and E(z∗21 ) are all well defined, and

F1(c)E(z21∗) + [1 − F1(c)]E(z∗21 ) = E(z2

1) = 1. (A5)

Now we are ready to consider the convergence of BN . It follows from (11) and (A1) that

BN =1

N

n∑

i=1

Σ−1(xi − µ)(xi − µ)′Σ−1 +1

N

N∑

i=n+1

(xi1 − µ1)2

σ211

0

0 0

(A6)

and

BN =1

N

n∑

i=1

Σ−1(xi − µ)(xi − µ)′Σ−1 +1

N

N∑

i=n+1

(xi1 − µ1)2

σ211

0

0 0

. (A7)

Let

B = F1(c)L′−1

(

E(z21∗) 0

0 1

)

L−1 +1 − F1(c)

σ11

(

E(z∗21 ) 00 0

)

. (A8)

In the following we will first show that BNP−→ B; next we will show that BN has the

same probability as BN by showing that BN − BNP−→ 0; then we will see that the matrix

A−1BA−1 is just the Ω given in (10).

We will need to relate the xi and xi1 in (A6) to zi1 and zi2 through (4) to facilitate

obtaining the probability limit of BN . Using xi−µ = Lzi, i = 1, 2, . . . n, and xi1−µ1 = σ1zi1,

24

i = n+ 1, n+ 2, . . ., N , we can rewrite (A6) as

BN =1

N

n∑

i=1

Σ−1Lziz′

iL′Σ−1 +

1

N

N∑

i=n+1

z2i1

σ110

0 0

= Σ−1L

1

N

N∑

i=1

wi

z2i1∗ zi1∗zi2∗

zi1∗zi2∗ z2i2∗

L′Σ−1

+1

N

N∑

i=1

(1 − wi)

z∗2i1

σ110

0 0

.

(A9)

Recall that Σ−1 = L′−1L−1, E(zi2∗) = 0, E(z2

i2∗) = 1; zi1∗, zi2∗ and wi are independent; and

z∗i1 and wi are independent. Applying the law of large numbers to (A9) yields

BNP−→ B. (A10)

Turning to BN , we can rewrite (A7) as

BN =1

N

n∑

i=1

Σ−1(xi − µ + µ − µ)(xi −µ + µ − µ)′Σ−1

+1

N

N∑

i=n+1

(xi1 − µ1 + µ1 − µ1)2

σ211

0

0 0

= BN + B(1)N12 + B

(1)N21 + B

(1)N22 + 2B

(2)N12 + B

(2)N22,

(A11)

where

B(1)N12 = Σ−1[

1

N

n∑

i=1

(xi − µ)](µ − µ)′Σ−1, B(1)N21 = B

(1)′

N12,

B(1)N22 = (

1

N

n∑

i=1

1)Σ−1(µ − µ)(µ − µ)′Σ−1,

B(2)N12 =

1

N

N∑

i=n+1

(xi1 − µ1)(µ1 − µ1)

σ211

0

0 0

,

B(2)N22 = (

1

N

N∑

i=n+1

1)

(µ1 − µ1)2

σ211

0

0 0

.

We will show that B(1)N12, B

(1)N21, B

(1)N22, B

(2)N12 and B

(2)N22 all approach zero in probability. Notice

that µP−→ µ was obtained in section 2. It follows from

1

N

n∑

i=1

(xi −µ) =1

N

N∑

i=1

wiLzi∗P−→ F1(c)LE(z∗)

25

that

B(1)N12 = B

(1)′

N21P−→ 0; (A12)

it follows from1

N

n∑

i=1

1 =1

N

N∑

i=1

wiP−→ F1(c)

that

B(1)N22

P−→ 0; (A13)

it follows from

1

N

N∑

i=n+1

(xi1 − µ1) =1

N

N∑

i=1

(1 −wi)σ1z∗

i1P−→ [1 − F1(c)]σ1E(z∗1)

and1

N

N∑

i=n+1

1 =1

N

N∑

i=1

(1 − wi)P−→ [1 − F1(c)]

that

B(2)N12

P−→ 0 and B(2)N22

P−→ 0. (A14)

Combining (A10) to (A14) yields

BNP−→ B.

We still need to show that A−1BA−1 equals the Ω in (10) for the consistency of ΩSWN .

We will achieve this by working with B and showing that B = A first, then we will show

that Ω = A−1. Let d = |L| = σ1σ2(1 − ρ2)1/2, then

L−1 = d−1

σ2(1 − ρ2)1/2 0

−σ2ρ σ1

.

It follows from

E(z21∗) 0

0 1

=

1 0

0 1

+

E(z21∗) − 1 0

0 0

that

L′−1

E(z21∗) 0

0 1

L−1 = Σ−1 +

[E(z21∗) − 1]/σ11 0

0 0

. (A15)

Using (A5) and combining (A8) and (A15) yield

B = F1(c)Σ−1 +

[1 − F1(c)]/σ11 0

0 0

.

26

Thus, B = A and

ΩSWNP−→ ΩSW = A−1.

We only need to show A−1 = Ω in order for ΩSWNP−→ Ω. It follows from

Σ−1 =1

d2

(

σ22 −σ12

−σ12 σ11

)

that we can write the A in (A2) as

A =

F1(c)σ22/d2 + [1 − F1(c)]/σ11 −F1(c)σ12/d

2

−F1(c)σ12/d2 F1(c)σ11/d

2

.

Let h2 = |A|, then

h2 = F1(c)σ22/d2 + [1 − F1(c)]/σ11F1(c)σ11/d

2 − F 21 (c)σ2

12/d4

=F 2

1 (c)

d4(σ11σ22 − σ2

12) +[1 − F1(c)]F1(c)

d2

=F 2

1 (c) + F1(c)[1 − F1(c)]

d2

=F1(c)

d2.

Consequently,

A−1 =1

h2

F1(c)σ11/d2 F1(c)σ12/d

2

F1(c)σ12/d2 F1(c)σ22/d

2 + [1 − F1(c)]/σ11

=

σ11 σ12

σ12 σ221 + [1/F1(c) − 1](1 − ρ2)

.

(A16)

Comparing (A16) with (10), we have A−1 = Ω. Thus, ΩSWN is consistent for Ω.

27

Appendix C

data raw;

filename data ’d:\missingdata\mardiaMV25.dat’; *need to be modified;

infile data;

input v1 v2 v3 v4 v5; *need to be modified;

run;

*-------------------------------------------------------------*;

use raw;

read all var _num_ into x;

close raw;

n=nrow(x);

p=ncol(x);

V_forana=1, 2, 4, 5; *need to be specified;

p_v=nrow(V_forana);

pvs=p_v+p_v*(p_v+1)/2;

run pattern(n,p,x,misinfo);

totpat=nrow(misinfo);

print "#total observed patterns=" totpat;

print "cases--#observed V--observed V--missing V=";

print misinfo;

run emmus(n,p,x,misinfo,hmu1,hsigma1);

hmu_s=hmu1[V_forana]‘;

hsigma_s=hsigma1[V_forana,V_forana];

print "hat\mu_s=";

print hmu_s;

print "hat\Sigma_s=";

print hsigma_s;

run Omega(n,p,hmu1, hsigma1,x,misinfo, omega_sw);

run indexv(p,V_forana,index_s); *index for both means and covariances;

run indexvc(p,V_forana,index_sc); *index for only the covariances;

run switch(p_v, permuc, permu); *generating permutation matrices;

homega_swc=omega_sw[index_sc,index_sc];

homega_swc=permuc*homega_swc*permuc‘;*needed for the 2nd stage ML in EQS;

print "hat\Omega_sw\sigma_s=";

print homega_swc;

homega_sw=omega_sw[index_s,index_s];

homega_sw=( permu*homega_sw*permu‘||j(pvs,1,0) )//(j(1,pvs,0)||1);

*needed for the 2nd stage ML in EQS;

print "hat\Omega_sw_s=";

print homega_sw;

28

Appendix D

/TITLE

EQS 6.1: Mean and covariance structure analysis

using the output of twosML.sas

/SPECIFICATION

weight=’d:\missingdata\homega.dat’;

cases=88; variables=4; matrix=covariance;

analysis=moment; methods=ML, robust;

/LABELS

V1=Mechanics; V2=Vectors; V3=Analysis; V4=Statistics;

/EQUATIONS

V1= F1+E1;

V2= *F1+E2;

V3= F2+E3;

V4= *F2+E4;

F1= *V999+D1;

F2= *V999+D2;

/VARIANCES

E1-E4= *;

D1=*;

D2=*;

/COVARIANCES

D1,D2= *;

/TECHNICAL

conv=0.00000001;

/Means

38.954545 50.58184 46.681818 40.912445

/MATRIX

302.29339 142.19539 105.06508 106.37877

142.19539 197.43999 100.4684 112.94803

105.06508 100.4684 217.87603 187.67792

106.37877 112.94803 187.67792 376.62339

/END

29

Appendix E

/TITLE

EQS 6.1: Covariance structure analysis using the output of twosML.sas

/SPECIFICATION

weight=’d:\missingdata\homegac.dat’;

cases=88; variables=4; matrix=covariance;

analysis=covariance; methods=ML, robust;

/LABELS

V1=Mechanics; V2=Vectors; V3=Analysis; V4=Statistics;

/EQUATIONS

V1= F1+E1;

V2= *F1+E2;

V3= F2+E3;

V4= *F2+E4;

/VARIANCES

E1-E4= *;

F1=*;

F2=*;

/COVARIANCES

F1,F2= *;

/TECHNICAL

conv=0.00000001;

/MATRIX

302.29339 142.19539 105.06508 106.37877

142.19539 197.43999 100.4684 112.94803

105.06508 100.4684 217.87603 187.67792

106.37877 112.94803 187.67792 376.62339

/END

30

References

Amemiya, T. (1973). Regression analysis when the dependent variable is truncated normal.

Econometrica, 41, 997–1016.

Anderson, T. W. (1957). Maximum likelihood estimates for the multivariate normal distribu-

tion when some observations are missing. Journal of the American Statistical Association,

52, 200–203.

Arminger, G., & Sobel, M. E. (1990). Pseudo-maximum likelihood estimation of mean and

covariance structures with missing data. Journal of the American Statistical Association,

85, 195–203.

Bentler, P. M. (2008). EQS 6 structural equations program manual. Encino, CA: Multivariate

Software.

Bentler, P. M., & Yuan, K.-H. (1999). Structural equation modeling with small samples:

Test statistics. Multivariate Behavioral Research, 34, 181–197.

Bollen, K. A. (1989). Structural equations with latent variables. New York: Wiley.

Browne, M. W. (1984). Asymptotic distribution-free methods for the analysis of covariance

structures. British Journal of Mathematical and Statistical Psychology, 37, 62–83.

Collins, L.M., Schafer, J.L., & Kam, C.K. (2001). A comparison of inclusive and restrictive

strategies in modern missing-data procedures. Psychological Methods, 6, 330–351.

Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood estimation from

incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical

Society B, 39, 1–38.

Enders, C. K. (2002). Applying the Bollen-Stine bootstrap for goodness-of-fit measures

to structural equation models with missing data. Multivariate Behavioral Research, 37,

359–377.

Enders, C. K. (2005). A SAS macro for implementing the modified Bollen-Stine bootstrap

for missing data: Implementing the bootstrap using existing structural equation modeling

software. Structural Equation Modeling, 12, 620–641.

Enders, C. K., & Peugh, J. L. (2004). Using an EM covariance matrix to estimate structural

equation models with missing data: Choosing an adjusted sample size to improve the

accuracy of inferences. Structural Equation Modeling, 11, 1–19.

Ferguson, T. S. (1996). A course in large sample theory. London: Chapman & Hall.

Gourieroux, C., Monfort, A., & Trognon, A. (1984). Pseudo maximum likelihood methods:

Theory. Econometrica, 52, 681–700.

31

Graham, J. W. (2003). Adding missing-data-relevant variables to FIML-based structural

equation models. Structural Equation Modeling, 10, 80–100.

Heckman, J. J. (1979). Sample selection bias as a specification error. Econometrica, 47,

153–161.

Hoffman, P. J. (1959). Generating variables with arbitrary properties. Psychometrika, 24,

265–267.

Hu, L., Bentler, P. M. & Kano, Y. (1992). Can test statistics in covariance structure analysis

be trusted? Psychological Bulletin, 112, 351–362.

Jamshidian, M., & Bentler, P. M. (1999). Using complete data routines for ML estimation of

mean and covariance structures with missing data. Journal of Educational and Behavioral

Statistics, 23, 21–41.

Laird, N. M. (1988). Missing data in longitudinal studies. Statistics in Medicine, 7, 305–315.

Lee, W.-C., & Rodgers, J. L. (1998). Bootstrapping correlation coefficients using univariate

and bivariate sampling. Psychological Methods, 3,91–103.

Little, R. J. A. (1988). Robust estimation of the mean and covariance matrix from data

with missing values. Applied Statistics, 37, 23–38.

Little, R. J. A., & Rubin, D. B. (2002). Statistical analysis with missing data (2nd ed.). New

York: Wiley.

Mardia, K. V. (1970). Measures of multivariate skewness and kurtosis with applications.

Biometrika, 57, 519–530.

Mardia, K. V., Kent, J. T., & Bibby, J. M. (1979). Multivariate analysis. New York:

Academic Press.

Micceri, T. (1989). The unicorn, the normal curve, and other improbable creatures. Psy-

chological Bulletin, 105, 156–166.

Rotnitzky, A., & Wypij, D. (1994). A note on the bias of estimators with missing data.

Biometrics, 50, 1163–1170.

Rubin, D. B. (1976). Inference and missing data (with discussions). Biometrika, 63, 581–592.

Satorra, A. & Bentler, P. M. (1994).Corrections to test statistics and standard errors in

covariance structure analysis. In A. von Eye & C. C. Clogg (Eds.), Latent Variables

Analysis: Applications for Developmental Research (pp. 399–419). Newbury Park, CA:

Sage.

Savalei, V., & Bentler, P. M. (2007). A two-stage ML approach to missing data: Theory

and application to auxiliary variables. UCLA Statistics Electronic Publications #511.

Schafer, J. L. (1997). Analysis of incomplete multivariate data. London: Chapman & Hall.

32

Schafer, J. L., & Graham, J. W. (2002). Missing data: Our view of the state of the art.

Psychological Methods, 7, 147–177.

Schafer, J. L., & Olsen, M. K. (1998). Multiple imputation for multivariate missing-data

problems: A data analyst’s perspective. Multivariate Behavioral Research, 33, 545–571.

Tanaka, Y., Watadani, S., & Moon, S. H. (1991). Influence in covariance structure analysis:

With an application to confirmatory factor analysis. Communication in Statistics-Theory

and Method, 20, 3805–3821.

Tobin, J. (1958). Estimation for relationships with limited dependent variables. Economet-

rica, 26, 24–36.

White, H. (1982). Maximum likelihood estimation of misspecified models. Econometrica,

50, 1–25.

Yuan, K.-H. (2007). Normal theory ML for missing data with violation of distribution

assumptions. Under review.

Yuan, K.-H., & Bentler, P. M. (1997). Mean and covariance structure analysis: Theoretical

and practical improvements. Journal of the American Statistical Association, 92, 767–774.

Yuan, K.-H., & Bentler, P. M. (1998). Normal theory based test statistics in structural

equation modeling. British Journal of Mathematical and Statistical Psychology, 51, 289–

309.

Yuan, K.-H., & Bentler, P. M. (2000). Three likelihood-based methods for mean and co-

variance structure analysis with nonnormal missing data. In M. E. Sobel & M. P. Becker

(Eds.), Sociological methodology 2000 (pp. 167–202). Oxford: Blackwell.

Yuan, K.-H., & Bentler, P. M. (2001). A unified approach to multigroup structural equation

modeling with nonstandard samples. In G. A. Marcoulides & R. E. Schumacker (Eds.),

Advanced structural equation modeling: New developments and techniques (pp. 35–56).

Mahwah, NJ: Lawrence Erlbaum Associates.

Yuan, K.-H., Marshall, L. L., & Bentler, P. M. (2002). A unified approach to exploratory

factor analysis with missing data, nonnormal data, and in the presence of outliers. Psy-

chometrika, 67, 95–122.

33

Documents

SEM with Missing Data and Unknown Population Distributions