A Toy Problem of Missing Data and A Comprehensive Studypengstats.macssa.com/download/notes/797-Project.pdf · class material and as an exercise in missing data analysis. keywords:

A Toy Problem of Missing Data and AComprehensive Study∗

Peng Wang

Abstract

Given the settings of a hypothetical data frame with customizable missing datamechanism, I explored various methods which are typically used in missing dataanalysis, compared their relative performance given the model and the parameterpre-settings, and discussed their strength and weakness as well as the potentialproblems in real-life applications. It serves both as a complementary study of theclass material and as an exercise in missing data analysis.

keywords: Missing Data, Imputation, Multiple Imputation, Bootstrap, EM Algo-rithm, Metropolis-Hasting Algorithm, Bayesian Inference, Data Augmentation.

∗Email: [email protected]

1

2

Contents

1 A Hypothetical Model of Missing Data and the Objective 3

2 A Notationally-Complete Definition for Missing-Data Modelling 3

3 Miss Conceptions about Missing Data and Imputation 53.1 Something Out of Nothing? . . . . . . . . . . . . . . . . . . . . . . . . . 53.2 Imputation or Manipulation? . . . . . . . . . . . . . . . . . . . . . . . . 6

4 Single Imputation Methods by Regression 64.1 Conditional Mean Imputation . . . . . . . . . . . . . . . . . . . . . . . . 64.2 Predictive Mean Imputation . . . . . . . . . . . . . . . . . . . . . . . . . 74.3 Conditional Draw Imputation . . . . . . . . . . . . . . . . . . . . . . . . 7

5 The Maximum Likelihood Approach 75.1 Maximum Likelihood Approach for Ignorable Likelihood . . . . . . . . . 75.2 Maximum Likelihood Approach for Non-ignorable Likelihood . . . . . . . 9

6 The EM Algorithm 96.1 EM Algorithm for Ignorable Likelihood . . . . . . . . . . . . . . . . . . . 96.2 EM Algorithm for Non-ignorable Likelihood . . . . . . . . . . . . . . . . 11

7 The Bayesian Approach 127.1 Bayesian by Metropolis-Hasting algorithm for Ignorable Likelihood . . . 127.2 Bayesian by Data Augmentation for Ignorable Likelihood . . . . . . . . . 147.3 Bayesian Approach for Non-ignorable Likelihood . . . . . . . . . . . . . . 15

8 Multiple Imputation for Ignorable likelihood 178.1 Multiple Imputation in this Problem . . . . . . . . . . . . . . . . . . . . 178.2 A Semi Multiple Imputation Scheme . . . . . . . . . . . . . . . . . . . . 178.3 Multiple Imputation by Data Augmentation . . . . . . . . . . . . . . . . 18

9 Simulation Results and Discussions 18

10 Remaining Questions 2010.1 Robust Imputation Ways . . . . . . . . . . . . . . . . . . . . . . . . . . . 2010.2 Non-parametric Multiple Imputation by Propensity Score . . . . . . . . . 20

11 The Methods Adopted for MI Softwares 2111.1 The PROC MI in SAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2111.2 The MI package in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

12 References 21

3

1 A Hypothetical Model of Missing Data and the

Objective

The data frame with adjustable missing-data mechanism is as follows. The vector Yi =(yi1, yi2, ui), i ∈ {1, · · · , n} is generated by the following scheme:

yi1 = 1 + zi1

yi2 = 5 + 2zi1 + zi2

ui = a(yi1 − 1) + b(yi2 − 5) + zi3

zi1, zi2, zi3 ∼ N(0, 1) independently.

This can be shown to be equivalent to multivariate normal distribution model Yi ∼N(µY ,ΣY ) with

µY =

150

ΣY =

1 2 a+ 2b2 5 2a+ 5b

a+ 2b 2a+ 5b (a+ 2b)2 + b2 + 1

Among the data, suppose Y1 is all observed, Y2 is partially observed and U is unobservedbased on the following scheme:

yi2 is missing if ui < 0

The values of (a, b) can vary to represent arbitrary missing-data mechanisms. Thethree setting we used are (1) missing completely at random (MCAR), a = 0, b = 0;(2) missing at random (MAR), a = 2, b = 0 and (3) not missing at random (NMAR),a = 0, b = 2.

Figure 1 shows how the data look like, giving different missing mechanisms.

y1

y2

0

2

4

6

8

10

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

−1 0 1 2 3

m

● FALSE

● TRUE

y1

y2

0

2

4

6

8

10

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

● ●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

● ●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

−1 0 1 2 3

m

● FALSE

● TRUE

y1

y2

0

2

4

6

8

10

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

● ●

●

●●●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●●

●

●

●

●

●

●

● ●

●

●

●

●

●●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●●

●●

●

−1 0 1 2 3

m

● FALSE

● TRUE

Figure 1: Visualization of the missing data frame under different settings. Left panel isMCAR; middle panel, MAR; right panel, NMAR.

Our objective was to estimate the mean and the covariance matrix of (y1, y2),aAndthe possibility of estimating the whole µY and ΣY was discussed.

2 A Notationally-Complete Definition for Missing-

Data Modelling

Even in Little & Robbin’s book, the definitions for different missing data mechanisms,that is MCAR, MAR and NMAR, are too abbreviated (page 11), and may cause confusionto the readers when they further encounter similar arguments in subsequent studies.

4

Here I enunciated the necessary and complete set of components involved in missingdata modelling, and redefined some of the mechanisms. All the similar arguments ormodelling process in missing data handling can be deemed as either abbreviated formsor reduced forms of it (mostly by integration), and no confusion should arise.

To completely model a problem with missing data, we need to define all the followingcomponents

Yo : all the observed data values;

Ym : all the missing data values;

M : the indicator matrix of missingness with 1/0 means missing/observed;

θ : all the paremeters on which the distributions of Yo and Ym depend;

φ : all the paremeters on which the distribution of M depends;

Notice that (1) for a given problem, we actually observed both Yo and M , thus bothare observed information; (2) for a given problem, except Yo and θ, not all componentsare necessarily needed for modelling; (3) φ and θ may not be distinct and may havefunctional relationships with each other.

Missing Completely at Random (MCAR): the missingness, or the missingness in-dicator M , does not depend on either the observed values Yo or the missing ones Ym.

f(M |Yo, Ym, θ, φ) = f(M |θ, φ)

Missing at Random (MAR): the missingness, or the missingness indicatorM , dependsonly on the observed values Yo, but not on the missing ones Ym.

f(M |Yo, Ym, θ, φ) = f(M |Yo, θ, φ)

Not Missing at Random (NMAR): the missingness, or the missingness indicator M ,depends on both the observed values Yo and the missing ones Ym.

f(M |Yo, Ym, θ, φ) 6= f(M |Yo, θ, φ)

With the above definition, as an example, we can deduce ignorability in a clearerway from the observed likelihood with MAR condition as follows

LObs = f(Yo,M |θ, φ) =

∮f(Yo, Ym,M |θ, φ)dYm

=

∮f(M |Yo, Ym, θ, φ)f(Yo, Ym|θ, φ)dYm

=

∮f(M |Yo, φ)f(Yo, Ym|θ)dYm

= f(M |Yo, φ)

∮f(Yo, Ym|θ)dYm

= f(M |Yo, φ)f(Yo|θ)

Besides the MAR condition, we do need φ and θ to be distinct, which means their param-eter space should be independent, and there should not be any functional form between

5

any of their elements. And we defined the observed likelihood as LObs = f(Yo,M |θ, φ),because Yo and M consist the entire observed information, and (θ, φ) are parametersbehind them.

Then the above deduction basically factored the overall likelihood into two parts, andwe can maximize each parts respectively to solve the MLE for φ and θ. Since most of thetime θ is of our sole interest, we can ignore the f(M |Yo, φ) part and just focus on f(Yo|θ).In this sense, ignorablity means, literally, that we are ignoring the M matrix along withthe parameter φ behind.

For ignorability in Bayesian inferences, we need φ and θ to be priori independent,which basically means

p(θ, φ) = p(θ)p(φ)

In most cases when we talk about MCAR or MAR, we assume ignorability or theindependence between φ and θ, and in the definition of MCAR and MAR the θ canbe further dropped in both cases. However, it is good to keep in mind that they arenot necessarily true. Which means, even if the data are MCAR or MAR, we are notnecessarily legitimate to make conclusions or use methods based on ignorability, althoughsuch occurrences are rare.

3 Miss Conceptions about Missing Data and Impu-

tation

3.1 Something Out of Nothing?

When I first heard about missing-data handling, the first question popped out is, can weindeed get something out of nothing that is missing? That sounds more like a magician’sjob rather than a statistician’s. The answer is, of course, NO, if the whole observation ismissing, or if the missed properties have nothing to do with the observed ones.

In fact, missing data here should be more accurately referred as incomplete data,which means, we should have at least partially observed an observation. The underlyingrationale of handling data set with such incomplete observations is based on the idea thatthose observed properties can potentially tell us something about the unobserved onesin each observation, while the traditional analysis would discard the whole incompleteobservations.

Thus we are not trying to get something out of nothing any more: we are using thepartially observed information to infer the unobserved properties, which is identical toall other statistical methods in a general sense.

An interesting observation of handling missing data is that, mathematically, there isno difference between completely-missing observations and partially-missing propertiesin observations. They are all “missing” properties in a model. Thus we can incorporateeither or both cases into your statistical model. The result is that , the analysis withcompletely-missing observations will degenerate to the analysis with only observed pluspartially-observed observations, which means the completely-missed ones would have noeffects in your final answer, although you included them into the model (you can try thiswith EM algorithm). This still coincides with our intuition: we can not get anything outof nothing.

The partially observed observations certainly play a role here, since we observed someproperties, and in many cases the missing properties are related to these observed ones

6

and we can exploit this. This is the whole point of missing-data analysis.

3.2 Imputation or Manipulation?

The statistician’s first intuition to imputation would be, also, NO. You did not observesomething and you fill them in artificially, then you are manipulating or forging the data.Good statisticians rarely, if I shouldn’t say never, manipulate or forge data; that shouldbe deemed as a miss-conduct.

The answer to this question, after carefully examining all the subsequent imputationmethods, is that if you manipulate or forge the data in a clever and regulated way, youcan still benefit from that. This is the rationale for all single imputation methods. Butfor multiple imputation, if you carry it in a strict way, it’s not manipulation at all, but asubclass of Bayesian analysis.

Take multiple imputation by data augmentation for example. In Bayesian analysis,the posterior distributions of the estimators mostly don’t have explicit forms. Thus weaugment the typical Bayesian model by including missing properties to facilitate themodelling, and use such as MCMC to take samples from the posteriors, both for theunknown parameters and for the missing values. The multiple imputed missing valuescan be deemed as a numerical way to solve the Bayesian inference problems. And theintuition for multiple imputation rather than single imputation is that we can only revealthe uncertainly carried in the missing values through multiple random cases rather thanjust one.

4 Single Imputation Methods by Regression

4.1 Conditional Mean Imputation

For conditional mean imputation, the regression line

yi2 = β0 + β1yi1 + εi, εi ∼ N(0, σ2), i = 1, · · · ,m

was estimated only from complete cases, and the missing values of Y2 were imputed bythe regression predictions given the estimated model.

All the following parameter estimates,

µ = (µ1, µ2) Σ =

[σ11 σ12σ12 σ22

]are directly computed from the imputed data set as if there were no missing data.

Bootstrap was used to infer the variance of all the above estimates. It was carriedover the entire original data set with missing values and the triplets (yi1, yi2, ui) wasbootstrapped together. For every bootstrap sample, an independent imputation andestimation was carried in order to get the estimate of the mean and the covariance matrixof (y1, y2).

7

µ1 =

(n∑i=1

yi1

)/n

yi2 = β0 + β1yi1 (Imputation for missing cases)

µ2 =

( ∑complete

yi2 +∑

imputed

yi2

)/n

Cov(yi1, yi2) = Σ =1

n− 1

n∑i=1

(yi − y)(yi − y)T

4.2 Predictive Mean Imputation

This should be the mean imputation plus a random number εi ∼ N(0, σ2).

yi2 = β0 + β1yi1 + εi Imputed values

But this method is not included in the final simulation analysis.

4.3 Conditional Draw Imputation

For conditional draw imputation using regression, given the regression model

yi2 = β0 + β1yi1 + εi, εi ∼ N(0, σ2), i = 1, · · · ,m

, the parameter draw from the Bayesian posteriors using Jeffreys prior are (from Molen-berg book page 112)

σ2 =(m− 2)s2

Xfor X ∼ χ2

m−2

~β ∼ N(β, σ2(F ′F )−1)

with m being the number of complete cases, and F being the m × 2 matrix with rowsconsisting of (1, yi1) from the complete cases.

Given the parameter draw (~βd, σ2d), the missing yi2 is imputed by the regression model,

and all the parameter estimates were calculated as if the data were complete, as before.And their variances were estimated by Bootstrap, as before.

5 The Maximum Likelihood Approach

5.1 Maximum Likelihood Approach for Ignorable Likelihood

Suppose we know or it is reasonable to assume that the data is at least missing-at-random(MAR), and it is also appropriate to assume that each independent pair y = (yi, yi2) hasa bivariate normal distribution as the following

yi =

[yi1yi2

]∼ N

(µ =

[µ1

µ2

],Σ =

[σ11 σ12σ12 σ22

])for all i ∈ {1, · · · , n}.

8

Given the observed data Y1 = {y11, · · · , y1n} fully observed and Y2 = {y21, · · · , y2r}with only the first r units of Y2 is observed, let θ = {µ1, µ2, σ11, σ12, σ22} be a collectionof all parameters involved in the model. Since the missing data is ignorable, the log-likelihood ratio of the observed data is

Lign = log f(Y1, Y2|θ)

= log

(r∏i=1

1

2π|Σ|1/2exp

[−1

2(yi − µ)TΣ−1(yi − µ)

] n∏i=r+1

1√2πσ11

exp

[− 1

2σ11(yi1 − µ1)

2

])

= −1

2r log |Σ| − 1

2

r∑i=1

(yi − µ)TΣ−1(yi − µ)− 1

2(n− r) log σ11 −

1

2σ11

n∑i=r+1

(yi1 − µ1)2 + C

with C as some residual constant and

Σ−1 =

[σ22

σ11σ22−σ212

−σ12σ11σ22−σ2

12−σ12σ11σ22−σ2

12

σ11σ11σ22−σ2

12

]If the data is missing-completely-at-random (MCAR), the modelling and the deduc-

tions of the likelihood follow exactly the same as above in the MCAR cases, which issurprising at first glance, but not that much if we consider both cases as having “ignor-able likelihood”, and indeed MCAR is just a special case of MAR.

To get the maximum likelihood estimate, we need to take the derivatives of the abovelikelihood function with respect to each parameter, and solve the subsequent likelihoodequations.

“But the likelihood equations, however, do not have an obvious solution. Andersonfactors the joint distribution of (yi1, yi2) into marginal distribution of yi1 and the condi-tional distribution of yi2 given yi1” (page 135 of Little & Rubin’s book), by convertingthe bivariate normal model into a univariate normal model plus a regression model, asshown by the following

f(yi1, yi2|µ,Σ) = f(yi2|yi1, β20.1, β21.1, σ22.1, µ1, σ11)× f(yi1|µ1, σ11)

f(yi1|µ1, σ11) = N(µ1, σ11)

f(yi2|yi1, β20.1, β21.1, σ22.1, µ1, σ11) = f(yi2|yi1, β20.1, β21.1, σ22.1)= N(β20.1 + β21.1yi1, σ22.1)

The reason that we are able to do such “converting” or factoring, is due to the factthat the new parameter set

φ = {µ1, σ11, β20.1, β21.1, σ22.1}

has a one-to-one correspondence to the original parameter set

θ = {µ1, µ2, σ11, σ12, σ22}

with µ1 and σ11 as common, and the other components of θ expressed as the followingfactions for the components of φ

µ2 = β20.1 + β21.1µ1

σ12 = β21.1σ11

σ22 = σ22.1 + β221.1σ11

9

they are considered to be “alternative parametrizations” of the same data model (I doubtsuch equivalence, although I agree they should be very similar).

After factoring the original likelihood into two likelihoods of two simple models, theirMLE’s can be easily solved directly, and the asymptotic properties of the estimates canbe evaluated by standard ways for MLE’s.

5.2 Maximum Likelihood Approach for Non-ignorable Likeli-hood

If it not reasonable to assume MAR, or we know the data is NMAR, then we need toassume certain structure of missingness in order to proceed. For this hypothetical prob-lem, suppose we know the trivariate normal distribution structure as originally defined,with all parameters unknown.

(yi1, yi2, ui)T ∼ N(µ∗,Σ∗) for all i ∈ {1, · · · , n}.

µ∗ =

µ1

µ2

µ3

Σ∗ =

σ11 σ12 σ13σ12 σ22 σ23σ13 σ23 σ33

In this case, the likelihood is not ignorable any more, which means M and φ are not

ignorable and we have to include them into the likelihood. Define

θ∗ = {µ1, µ2, µ3, σ11, σ22, σ33, σ12, σ13, σ23}

This θ∗ contains all possible parameters for data and the missingness. Thus the log-likelihood function is

LObs = log f(Y1, Y2,M |θ)

= log

(r∏i=1

f(yi1, yi2,mi = 0)n∏

i=r+1

f(yi1,mi = 1)

)

= log

(r∏i=1

f(yi1, yi2, ui ≥ 0)n∏

i=r+1

f(yi1, ui < 0)

)Solving the log-likelihood equations for the above log-likelihood involves potentially in-tegration and inverting the covariance matrix. Thus the solution for MLE’s may be notobvious either.

One possible way is again to convert the trivariate normal model into two sequen-tial regression models plus a univariate normal model, similar to the way they used forignorable case in the book

f(Y1, Y2, U |θ) = f(U |Y1, Y2, θ)f(Y2|Y1, θ)f(Y1|θ)

But its details hasn’t been developed in this paper.

6 The EM Algorithm

6.1 EM Algorithm for Ignorable Likelihood

Again suppose we know or it is reasonable to assume that the data is at least missing-at-random (MAR), and it is also appropriate to assume that each independent pair y =

10

(yi, yi2) has a bivariate normal distribution as the following

yi =

[yi1yi2

]∼ N

(µ =

[µ1

µ2

],Σ =

[σ11 σ12σ12 σ22

])for all i ∈ {1, · · · , n}.

E Step: find the expectation of the following sufficient statistics given all the observedvalue YO = (Y1, Y2) and the current parameter estimate θ = {µ1, µ2, σ11, σ12, σ22}:

s1 =n∑i=1

yi1 s2 =n∑i=1

yi2 s11 =n∑i=1

y2i1 s22 =n∑i=1

y2i2 s12 =n∑i=1

yi1yi2

Given the observed value, s1 and s11 are constants. The rest of the three expectationscan be found by the properties of multivariate normal distribution

E(s2|YO, θ) =r∑i=1

yi2 +n∑

i=r+1

[µ2 + σ12(yi1 − µ1)/σ11]

E(s22|YO, θ) =r∑i=1

y2i2 +n∑

i=r+1

E(y2i2|yi1, θ)

=r∑i=1

y2i2 +n∑

i=r+1

[V ar(yi2|yi1, θ) + (E(yi2|yi1, θ))2

]=

r∑i=1

y2i2 +n∑

i=r+1

[(σ22 − σ2

12/σ11) + (µ2 + σ12(yi1 − µ1)/σ11)2]

E(s12|YO, θ) =r∑i=1

yi1yi2 +n∑

i=r+1

E(yi1yi2|yi1, θ)

=r∑i=1

yi1yi2 +n∑

i=r+1

yi1E(yi2|yi1, θ)

=r∑i=1

yi1yi2 +n∑

i=r+1

yi1[µ2 + σ12(yi1 − µ1)/σ11]

M Step: given the filled expectations, calculate the usual maximum likelihood estimateswhich are also the moment-based estimators of µ and Σ as

µ1 = s1/n

µ2 = s2/n

σ11 = s11/n− µ21

σ22 = s22/n− µ22

σ12 = s12/n− µ1µ2

The new estimated parameters will be deemed as updates and be passed into next roundof iteration.

Convergence can be confirmed graphically or numerically (Fig.2), and most of thetime it converges very fast. The variance of the estimates is accessed by bootstrapping.

11

Convergence of the EM algorithm at a=0 b=2

Iteration

Par

amet

er V

alue

2

4

6

8

10

20 40 60 80 100

var

mu2

sig12

sig22

Figure 2: Convergence of the EM algorithm

6.2 EM Algorithm for Non-ignorable Likelihood

For non-ignorable likelihood which is the NMAR case, we just need to model M alongwith its parameters, just as what we did for the maximum likelihood case. But findingthe conditional expectation is harder.

(yi1, yi2, ui)T ∼ N(µ∗,Σ∗) for all i ∈ {1, · · · , n}.

µ∗ =

µ1

µ2

µ3

Σ∗ =

σ11 σ12 σ13σ12 σ22 σ23σ13 σ23 σ33

θ∗ = {µ1, µ2, µ3, σ11, σ22, σ33, σ12, σ13, σ23}

E Step: find the expectation of sufficient statistics given all the observed informationYo = (Y1, Y2) and M . In a similar way we have the following sufficient statistics

{s1, s2, s3, s11, s22, s33, s12, s13, s23}

Given the observed value, s1, s11 are again constants. The rest of the conditional expec-

12

tations are

E(s2|YO,M, θ∗) =r∑i=1

yi2 +n∑

i=r+1

E(yi2|yi1,mi = 1, θ∗)

f(yi2|yi1,mi = 1) =f(yi2, yi1, ui < 0)

f(yi1, ui < 0)=

∫ 0

−∞ f(yi2, yi1, ui)dui∫ 0

−∞ f(yi1, ui)dui

=f(yi1)f(yi2|yi1)

∫ 0

−∞ f(ui|yi2, yi1)duif(yi1)

∫ 0

−∞ f(ui|yi1)dui

=f(yi2|yi1)

∫ 0

−∞ f(ui|yi2, yi1)dui∫ 0

−∞ f(ui|yi1)dui

E(s3|YO,M, θ∗) =r∑i=1

E(ui|yi1, ui < 0) +n∑

i=r+1

E(ui|yi1, yi2, ui ≥ 0)

f(ui|yi1, ui < 0) =f(ui, yi1)∫ 0

−∞ f(ui, yi1)dui=

f(ui|yi1)∫ 0

−∞ f(ui|yi1)duiui ∈ (−∞, 0)

f(ui|yi1, yi2, ui ≥ 0) =f(ui, yi1, yi2)∫∞

0f(ui, yi1, yi2)dui

=f(ui|yi1, yi2)∫∞

0f(ui|yi1, yi2)dui

ui ∈ [0,∞)

· · · · · · (Another 5 conditional expectations needed) · · · · · ·

The above expectations don’t have any explicit form, but they can be calculatednumerically. Notice that the denominator of all the above conditional probabilities areconstants.

7 The Bayesian Approach

Given a prior and a parametric model for the data, we can always use Bayesian inferenceto find the posterior, and all the subsequent inferences can be based on the posteriordistributions.

The problem is that the posterior distributions may not have explicit forms, and weneed MCMC algorithms to draw samples from the posterior distribution. The construc-tion and the tuning of the MCMC algorithms may be not trivial.

7.1 Bayesian by Metropolis-Hasting algorithm for Ignorable Like-lihood

In this case we totally borrow the parametric model assumption and the ignorable like-lihood function from 5.1, and will use Metropolis-Hastings algorithm to draw samplesfrom the posterior distribution.

To estimate the parameter set

θ = (µ,Σ) = (µ1, µ2, σ11, σ12, σ22)

we adopt the following improper, non-informative prior, which is the Jeffreys Prior

p(θ) ∝ |Σ|−3/2

13

The posterior distribution of θ is

p(θ|Y1, Y2) =f(Y1, Y2|θ)p(θ)∮f(Y1, Y2|θ)p(θ)dθ

∝ f(Y1, Y2|θ)|Σ|−3/2

f(Y1, Y2|θ) =r∏i=1

1

2π|Σ|1/2exp

[−1

2(yi − µ)TΣ−1(yi − µ)

]×

n∏i=r+1

1√2πσ11

exp

[− 1

2σ11(yi1 − µ1)

2

]Since asymptotically θMLE will have a multivariate normal distribution, and Bayesian

inferences with non-informative prior is almost equivalent to ML approaches, I first usedan independent multivariate normal proposal for θ.

The parameter estimates and their variance estimates from the EM algorithm is usedto construct the following independent 5-dimensional multivariate normal proposal

p(θ(1)) = N(µp,Σp)

µp = θ Σp = Diag(σµ1 , σµ2 , σσ11 , σσ12 , σσ22)

The problem I encountered in proposing an invalid covariance matrix, and the eventualbad performance of the multivariate normal proposal after adjustments, made me realizethat a Wishart distribution proposal for the covariance matrix may be more appropriate.Thus the final proposal of (µ,Σ) is as follows

p(µ(1),Σ(1) = p(µ(1)) ∗ p(Σ(1))

µ(1) ∼ N(µp,Σµp)

Σ(1) ∼Wishart(S/df, df)

The above θ(1) will be the proposed parameter. And let θ(0) be the current value ofthe chain, given the independent proposal p(θ(1)), the acceptance probability of θ(1) is

ρ(θ(0), θ(1)) = min

{p(θ(1)|Y1, Y2)g(θ(0))

p(θ(0)|Y1, Y2)g(θ(1)), 1

}To increase the numerical stability of the above calculation, we use log to transform itfirst, and convert to the original value by taking the exponent.

log

{p(θ(1)|Y1, Y2)g(θ(0))

p(θ(0)|Y1, Y2)g(θ(1))

}= log p(θ(1)|Y1, Y2) + log g(θ(0))− log p(θ(0)|Y1, Y2)− log g(θ(1))

log p(θ(1)|Y1, Y2) =r∑i=1

[− log 2π − 1

2log |Σ(1)| − 1

2(yi − µ(1))T (Σ(1))−1(yi − µ(1))

]

+n∑

i=r+1

[−1

2log 2π − 1

2log σ

(1)11 −

1

2σ(1)11

(yi1 − µ(1)1 )2

]

− 3

2log |Σ(1)|

log g(θ(0)) = −5

2log 2π − 1

2log |Σp| −

1

2(θ(0) − µp)TΣ−1p (θ(0) − µp)

14

After further tuning and testing of the parameter settings in the proposal, we finalreached an seemly-optimal and acceptable acceptance rate of the chain. The convergenceof the chain and the numerical properties of the posterior distributions are shown inFigure 3.

0 200 600 1000

0.8

1.1

mu1

Iteration

Par

amet

er v

alue

0 20 40 60 80

0.80

1.10

AcceptRate=0.181

Iteration

Par

amet

er v

alue

mean=1.06 Var=0.006

Parameter value

Den

sity

0.8 1.0 1.2

03

0 200 600 1000

4.6

5.4

mu2

Iteration

Par

amet

er v

alue

0 20 40 60 80

4.8

5.4

AcceptRate=0.181

Iteration

Par

amet

er v

alue

mean=5.125 Var=0.039

Parameter value

Den

sity

4.6 5.0 5.4

0.0

1.5

0 200 600 1000

0.7

1.1

sig11

Iteration

Par

amet

er v

alue

0 20 40 60 80

0.7

1.0

AcceptRate=0.181

Iteration

Par

amet

er v

alue

mean=0.908 Var=0.008

Parameter value

Den

sity

0.7 0.9 1.1

03

0 200 600 1000

1.4

2.4

sig12

Iteration

Par

amet

er v

alue

0 20 40 60 80

1.4

2.0

AcceptRate=0.181

Iteration

Par

amet

er v

alue

mean=1.772 Var=0.032

Parameter value

Den

sity

1.4 1.8 2.2 2.6

0.0

2.0

0 200 600 1000

3.5

6.0

sig22

Iteration

Par

amet

er v

alue

0 20 40 60 80

4.0

AcceptRate=0.181

Iteration

Par

amet

er v

alue

mean=4.492 Var=0.209

Parameter value

Den

sity

3.5 4.5 5.5 6.5

0.0

0.6

Figure 3: The convergence of the Metropolis-Hasting algorithm at MAR.

7.2 Bayesian by Data Augmentation for Ignorable Likelihood

Another way to draw sample from the posterior distribution of the parameters is throughdata augmentation (DA). Given the current draw of parameter θ(t) of step t, the iteration

15

steps for data augmentation are defined as

I Step: Draw Y (t+1)m with density p(Ym|Yo, θ(t))

P Step: Draw θ(t+1) with density p(θ|Yo, Y (t+1)m )

For this particular example, the I step would be

y(t+1)m,i2 ∼ p(ym,i2|yo,i1, θ(t))

= N(µ(t)2 + σ

(t)12 (yi1 − µ(t)

1 )/σ(t)11 , (σ

(t)22 − (σ

(t)12 )2/σ

(t)11 ))

The P step would be

S =n∑i=1

(yi − y)(yi − y)T = (n− 1)Σ

(Σ(t+1)|Y (t+1)) ∼ Inverse-Wishart(S(t+1), n− 1)

(µ(t+1)|Σ(t+1), Y (t+1)) ∼ Nk(Y(t+1),Σ(t+1)/n)

The random parameter draws from the inverse-Wishart distribution is made possible byR package MCMCpack.

The performance of the data augmentation and the posterior distribution of the pa-rameters are shown in Figure 4.

7.3 Bayesian Approach for Non-ignorable Likelihood

In the same way, we borrow the parametric model assumption and the likelihood functionfrom 5.2, and may use Metropolis-Hastings algorithm to draw samples from the posterior.The new parameter set is

θ∗ = (µ∗,Σ∗){µ1, µ2, µ3, σ11, σ22, σ33, σ12, σ13, σ23}

The Jeffreys prior isp(θ∗) ∝ |Σ∗|−2

The posterior distribution is

p(θ∗|Y1, Y2,M) ∝

[r∏i=1

f(yi1, yi2, ui ≥ 0)n∏

i=r+1

f(yi1, ui < 0)

]× |Σ∗|−2

=

[r∏i=1

f(ui ≥ 0|yi1, yi2)f(yi1, yi2)n∏

i=r+1

f(ui < 0|yi1)f(yi1)

]× |Σ∗|−2

Although theoretically it might be doable, considering the problem encountered and therelatively poor final convergence in the previous case, we just stopped trying.

16

0 200 600 1000

0.8

1.3

mu1

Iteration

Par

amet

er v

alue

0 20 40 60 80

0.9

1.3

mu1

IterationP

aram

eter

val

ue

mean=1.11 Var=0.009

Parameter value

Den

sity

0.8 1.0 1.2 1.4

02

4

0 200 600 1000

5.0

6.5

mu2

Iteration

Par

amet

er v

alue

0 20 40 60 80

5.0

mu2

Iteration

Par

amet

er v

alue

mean=5.883 Var=0.042

Parameter value

Den

sity

5.0 5.5 6.0 6.5

0.0

1.5

0 200 600 1000

0.6

1.2

sig11

Iteration

Par

amet

er v

alue

0 20 40 60 80

0.6

1.1

sig11

Iteration

Par

amet

er v

alue

mean=0.838 Var=0.013

Parameter value

Den

sity

0.6 0.8 1.0 1.2

02

0 200 600 1000

0.5

2.0

sig12

Iteration

Par

amet

er v

alue

0 20 40 60 80

1.0

2.5

sig12

Iteration

Par

amet

er v

alue

mean=1.075 Var=0.048

Parameter value

Den

sity

0.5 1.0 1.5 2.0 2.5

0.0

1.5

0 200 600 1000

14

sig22

Iteration

Par

amet

er v

alue

0 20 40 60 80

14

sig22

Iteration

Par

amet

er v

alue

mean=2.17 Var=0.244

Parameter value

Den

sity

1 2 3 4 5 6

0.0

0.6

Figure 4: The performance of the data augmentation algorithm.

17

8 Multiple Imputation for Ignorable likelihood

8.1 Multiple Imputation in this Problem

For multiple imputation, once the multiple imputed data set have been created, theparameter estimates and their variances are calculated by

θ =

∑Mm=1 θmM

var(θ) =

∑Mm=1 σ

2m

M+M + 1

M

∑Mm=1(θm − θm)2

M − 1

θm is the estimate of θ in the mth imputed data set. θm is the average of all the parameterestimates of θ across a total of M imputed data set. σ2

m is the variance estimate of θ inthe mth data set.

To estimate the variance of the variance and covariance estimators in one data set, Iused the following result

σ11 = s21 =1

n

n∑i=1

(yi1 − yi1)2

Var(s21) =2σ11n− 1

Var(σ11) = Var(s21) =2σ11n− 1

=2s21n− 1

Cov(Y ) = σ12 =1

n− 1

n∑i=1

(yi − y)(yi − y)T

Var(σ12) =1

n− 1

n∑i=1

[σ12(σ12 + 2µ1µ2) + σ11σ22]

Var(σ12) =1

n− 1

n∑i=1

[σ12(σ12 + 2µ1µ2) + σ11σ22] (†)

8.2 A Semi Multiple Imputation Scheme

We can do the conditional draw imputation for M = 10 times, and treat the 10 data setsas the multiple imputation data set.

The problem for this method is that, by definition the parameter draws of the multipleimputation should be made from the posterior distribution of the parameters given allthe observed data, while in the conditional draw imputation, the posterior distributionof the parameters are only based on complete cases.

While you call this method a Semi-multiple imputation scheme, in Rubin’s bookMultiple imputation for Non-response in Surveys, page 165, it is deemed as a “ture” MIway, probably because the claimed θY |XandθX are priori independent and thus posterioriindependent too. You should further study this. And the regressional multiple imputationby SAS also adopted this method.

Thus I call this way a semi multiple imputation scheme. The advantage of this schemeis that the posteriors of the complete case are easier to derive and mostly have explicit

18

forms. Thus it is much easier to implement. As long as the rate of missingness is not high,the performance of this method is very close to the “true” multiple imputation method.

8.3 Multiple Imputation by Data Augmentation

In the Bayesian approach by Data Augmentation, we take several parameter draws fromthe posterior after that the chain reaches stationary. Then we can generate M = 10imputation data set, and carry the analysis as stated before.

9 Simulation Results and Discussions

Ideally doing many simulations and averaging all the result across simulations (such asaveraging 1000 simulation results) may provide more convincing result, and we may alsostudy other aspects of different methods by doing such, such as studying the coverageprobabilities of the confidence intervals constructed from different methods. But dueto the relative long execution time for each simulation, and the fact the outputs fromdifferent simulations do perform fairly consistently, here we only provide an output froma single simulation for different methods under different missing mechanisms.

The final outputs for are shown in Figure 5. As stated only one simulation is shownfor each missing mechanism, and the output from the mi package is also included as abench mark for comparison. (Krista suggested that it’s better if you can use plots ratherthan value tables to show the results– It’s easier to read and comprehend. This is a goodidea and you need to think about how to represent all the information you want to conveyby plots. Coloured square matrices should be your start point.)

We can conclude the output into the following points and remarks:

1. Generally all methods performs as expected–complete case analysis always under-estimate the variance of the estimates, because it’s ignoring part of the observedinformation, and it causes serious bias in the NMAR case; conditional mean andconditional draw imputation will recover the bias under the NMAR case, but theydo funny things in estimating the variance (discussed below); the EM algorithm,Bayesian by DA approach and all the multiple imputation approaches behave re-markably similar, because essentially they are based on the same likelihood–exceptthe Bayesian approach by the Metropolis-Hastings algorithm. It performs relativelybadly in all the estimates, probably due to its bad convergence.

2. The bootstrap estimates of variance of the conditional mean and conditional drawmethod for µ2 (the sd_mu2 column)almost always produce larger estimates thanother methods do. This is a little bit unexpected at the first glance, since we learnedthat they should “underestimate” the variance. But when we say “underestimate”,it means we are estimating the variance of µ2 from the imputed data set rather thanfrom bootstrapping, and the bootstrapping method would reflect more uncertainlyin the regression fitting giving partial observed data. Thus the final estimate of µ2

would have more fluctuation than that from other methods.

3. The variance estimate of the EM algorithm for σ22 (column sd_sig22) alwaysproduces larger estimates than all other methods. This is suspicious because itshouldn’t be. One reason could be that the convergence for σ22 is not as good as

19

Figure 5: The output of all methods under different missing-data mechanisms. For thefirst 5 methods, the variance estimates of the parameter estimates are done by Bootstrapping. For thetwo Bayesian methods, they are done by calculating the variance from the samples of the posteriordistribution. And for multiple imputation method we have stated formulas for that.

20

others, and I used nem = 100 iteration as a universal cut-off points for all estima-tors. This should be further confirmed.

4. The variance estimates for σ12 (column sd_sig12) can be split into two groups:the EM algorithm and the Bayesian by DA always have similar result, and allthe multiple imputation methods will have another result. Here I trust the resultcoming from the first group, since for calculating this variance estimate for thecovariance I used an unconfirmed formula in the multiple imputation section (the(†) one). Further study needed for correcting this mistake, regarding which shouldbe the best way to estimate it.

5. If the true model is NMAR and we analyze the model as if it were MAR, all methodswe used will produce significant bias. But complete case analysis will produce thebiggest bias of all. All the other methods will adjust such bias to some extent byrecovering information from the correlated variable Y1.

Generally, under such a hypothetical problem, considering the easy-of-implementationand the final performance, I would conclude that the Bayesian analysis by Data Augmen-tation and multiple imputation by Data Augmentation are the best methods of all. Sincemultiple imputation has other characteristics like computationally economical, it may beeven better for similar problems in real life. The semi multiple imputation method is nota bad choice as long as the rate of missingness is not very high. The EM algorithm isalso dependable, but as the model grows larger, its implementation might be more andmore tedious.

But given a different problem, implementing the Data Augmentation algorithm mayalso be hard, if there is no obvious solution for posterior distribution for either theparameter or the missing values.

And real-life problems are never as clean and ideal as such a model-based simulatedproblem. Yalin’s project reminds me that even in a real-life regression problem as simpleas that, all these model based imputation methods may produce miss-leading or hard-to-explain result. I think the major problem is that if we can indeed impute the valuesthat best mimic the observed data. In Yalin’s problem, if the data is not even an strictmultivariate or regression model, then imputing the missing value by a multivariate orregression model will certainly introduce bias. Unless our data is extremely well behaved,I do think some more robust ways of imputation will make more sense.

10 Remaining Questions

10.1 Robust Imputation Ways

Check Rubin’s second edition first. You missed that chapter.

10.2 Non-parametric Multiple Imputation by Propensity Score

References: Rubin 1987, pp. 124, 158; Lavori, Dawson, and Shera 1995For continuous variables in data sets with arbitrary missing patterns, you can use

the Markov chain Monte Carlo (MCMC) method (Schafer 1997) to impute either allthe missing values or just enough missing values to make the imputed data sets havemonotone missing patterns.

21

What about a mixture of categorical and continuous variables? Any previous studiesabout this sort?

11 The Methods Adopted for MI Softwares

11.1 The PROC MI in SAS

The Regression Method for Monotone Missing Data is indeed the semi-multiple impu-tation method you mentioned above which uses conditional draw imputation mentioned.Based on the SAS documentation which refers to Rubins book Multiple imputation fornon-response in surveys, this is indeed a MI, not a semi-MI. You need to further confirmthis.

The predictive mean matching method ensures that imputed values are plausibleand might be more appropriate than the regression method if the normality assumptionis violated (Horton and Lipsitz 2001, p. 246).

Further questions:

1. Can it deal with a combination of categorical and continuous variables in a singledata set? Looks like NOT;

2. How does this carry the MCMC for monotone and nonmonotone missing patterndata? Is is metropolis-hasting like, or the data augmentation like?

It is data augmentation by the Documentation.

Then how it deals with the “variance of the covariance”? Does it omit that?

3. How is the propensity score MI carried by SAS?

Check the Documentation and the original Rubin’s book.

“It is effective for inferences about the distributions of individual imputed vari-ables, such as a univariate analysis, but it is not appropriate for analyses involvingrelationship among variables, such as a regression analysis (Schafer 1999, p. 11).It can also produce badly biased estimates of regression coefficients when data onpredictor variables are missing (Allison 2000).”

What is the “Bayesian bootstrap” method involved in this method?

11.2 The MI package in R

In the examples provided in the Documentation of MI, it looks like it can deal with acombination of different kind of variable types. This is good.

12 References

1. Statistical Analysis with Missing Data, Second Edition, Roderik J. A. Little. Don-ald B. Rubin.

Documents

A Toy Problem of Missing Data and A Comprehensive Studypengstats.macssa.com/download/notes/797-Project.pdf · class material and as an exercise in missing data analysis. keywords: