1 S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008 Multiple Imputation for households surveys A comparison of methods Stata Users Group Meeting

1 S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008

Multiple Imputation for households surveys

A comparison of methods

Stata Users Group Meeting

Rodrigo Alfaro – Marcelo Fuenzalida


Outline

Multiple Imputation (MI) Methods for MI Chilean Household Surveys Application of MI to EFH Comments and Conclusions Appendix


Multiple Imputation

In empirical applications researchers must work with incomplete data sets.

A “solution” to the problem above is known as Multiple Imputation (MI) procedure.

MI relies on the assumption Missing At Random: “…the probability of missing data on Y is unrelated to

the value of Y, after controlling for other variables in the analysis.” (Allison, 2002)

In our empirical application we assume MAR. Validity of assumption is beyond the scope of this

analysis.


Multiple Imputation

MI is based on the assumption that we have a good proxies of distributions of missing observations in the sample. Given this: We could “fill the blanks” taking random realizations. We create m-versions of the complete datasets in order

to reflect randomness of the procedure.

So, we change our incomplete data set for a set of complete ones. We do not “solve” missing data problem we just measure it.

Analysis of multiple data sets could be combine using Rubin’s rules (Rubin, 1987).


Methods for MI

We divide methods into 2 categories Conditional: Hot-Deck and Univariate. Multivariate: Normal and Chained Equations.

Hot-Deck (hotdeck.ado) Replace missing observations with an observed one taken

randomly within a specific group: males with college. Informal conversations: std. dev.’s are still small.

Univariate (uvis.ado) Regress variable with missing observations on exogenous

variables with no missing. Draw posterior of estimators (beta & sigma) “Predict” missing values.


Methods for MI

Chained Equation (ice.ado) Based on Univariate method, but with the possibility of

having missing values in exogenous variables. Using reverse equation missing values of exogenous

variables are replaced. Loop over previous steps.

Normal Assume Multivariate Normal. Estimate parameters using

EM algorithm (or other initial value), and draw imputations using Data Augmentation procedure.

Theory relies on the convergence of EM. No implemented in Stata. Schafer’s stand-alone

package.


Methods for MI

Schafer’s stand-alone package: Data.


Methods for MI

Schafer’s stand-alone package: EM.


Methods for MI

Schafer’s stand-alone package: DA.


Methods for MI

Schafer’s stand-alone package: DA.


Chilean Households Surveys

We have households surveys with a few number of waves: CASEN, and EPS. CASEN was created to measure poverty. EPS was created to evaluate pension system.

At the Central Bank we have been using these surveys to analyze financial fragility of households.

However, CASEN and EPS were not created for this purpose. We need new sample designs.

In 2007, we started a new survey designed for our purposes: EFH.


Chilean Households Surveys

Our surveys have different levels of information Personal information of each member of the household.

For example: age, year of education, labor income, etc. Aggregate information of the household. For example:

value of assets (cars, house, financial instruments, etc.), debts (mortgage, consumer loans, educational loans, etc.)

Our variables of interest could be irrelevant for some households in the sample. Many households have loans with retails-companies

instead of borrowing the money directly from banks. Few households have personal savings invested in

financial instruments such as stocks, bonds, etc.


Application of MI to EFH

Using conditional methods, we could attach the constraints to the imputation procedure. We are able to impute labor income for each member of

the household, considering only individual level vars. At the household level, we could impute “banks loans” in

a sub-sample of households that declared to have that kind of debt. We use as exogenous variables age, years of education, and gender of interviewee.

We impute “debt in retails-companies” with a different sub-sample but with the same exogenous variables.

But, we cannot impute “debt in retails-companies” with “banks loans” because sub-samples may be different.


Application of MI to EFH

Multivariate methods imply groups of households. Suppose that a household without a house, we could

pretend that the value of its house is “zero”. However, that will be affect the correlation between value of the house and total amount of debts.

Our first round includes 3 groups defined by the credit access. First group includes households without debts in

financial institutions and without any kind of assets. Second group includes households with real assets: cars,

and primary house. Third group adds households with debts in financial

institutions.


Missing Information

Source: EFH 2007.

Results for EFH

Missing values Total observations % of missingAssets Primary house 80 2637 3% 2557

Vehicules 158 2003 8% 1845

Debt Mortgage 137 769 18% 632Debt in retail companies 320 1901 17% 1581Bank loan 47 670 7% 623

4021

Number of observations with

data

Total Number of Households in EFH 2007

Low missing rate of information.

However, combining variables reduces sample size.

We will concentrate our analysis in the second group.

We use logit transformation to avoid unbounded results.


Source: EFH 2007.

Group 2: Conditional

N mean sd mean sdPrimary House 124 41,300,000 51,400,000 41,300,000 51,400,000

HD_sex 42,021,053 53,054,393 41,363,910 51,875,497HD_seq 40,804,511 50,395,197 41,843,008 53,564,263

uvis_sex 45,799,319 61,270,298 45,192,579 60,678,326uvis_seq 45,343,722 61,377,351 45,508,348 61,755,670

Vehicules 121 5,430,165 14,000,000 5,430,165 14,000,000HD_sex 5,515,865 13,545,393 5,606,466 14,084,994HD_seq 6,398,767 21,334,829 5,985,748 17,893,250

uvis_sex 5,732,371 13,955,271 5,678,109 13,815,379uvis_seq 5,861,634 16,442,425 6,043,486 15,476,726

Mortgage 109 14,500,000 10,800,000 14,500,000 10,800,000HD_sex 15,683,299 13,673,049 15,580,981 13,076,041HD_seq 15,753,395 12,699,439 15,778,602 13,257,939

uvis_sex 16,540,712 14,795,948 16,521,443 15,674,534uvis_seq 16,703,439 16,311,423 16,876,838 16,434,770

Debt in retail companies 118 564,897 1,076,239 564,897 1,076,239HD_sex 540,023 1,025,211 549,424 1,056,475HD_seq 543,912 1,029,810 548,536 1,032,109

uvis_sex 542,185 1,037,191 568,213 1,067,170uvis_seq 565,666 1,033,445 563,049 1,034,414

133

133

m=5 m = 10

133

133


Households with debt in retails companies. UVIS and HD could have lower variances than raw data. We note that ICE and NORM are consistent in sd’s.

Source: EFH.

Group 2: Multivariate

N mean sd mean sdPrimary House 124 41,300,000 51,400,000 41,300,000 51,400,000Norm 45,666,537 63,217,837 45,696,699 63,810,076Ice 42,870,280 57,188,693 42,981,100 57,419,325

Vehicules 121 5,430,165 14,000,000 5,430,165 14,000,000Norm 6,498,768 17,089,963 6,446,325 16,600,325Ice 7,282,372 19,242,160 6,558,340 17,211,530

Mortgage 109 14,500,000 10,800,000 14,500,000 10,800,000Norm 15,734,073 13,055,818 15,686,541 12,946,747Ice 15,181,216 12,585,790 15,431,263 12,719,416

Debt in retail companies 118 564,897 1,076,239 564,897 1,076,239Norm 635,011 1,242,141 654,637 1,285,391Ice 707,768 1,427,360 672,833 1,357,371

133

133

133

133

m=5 m = 10


Comments and Conclusions

In our empirical application Hot-Deck as well as Univariate imputation have smaller variances than multivariate methods.

Under a multivariate imputation we are able to have a reasonable standard deviation that reflects the uncertainty of complete data sets.

Moving to multivariate methods we have ICE and NORM. Both have advantages and disadvantages.

ICE is implemented in Stata with many features available to accommodate several models.

In that respect is more general than NORM.



NORM relies in EM algorithm in theoretical terms and DA for imputation method. For that we observed that we need convergence and “reasonable” positive definite matrix.

In practical terms we observed that a high rate of missing data is associated with non-converge of EM algorithm and/or some problems with DA.

In the case of ICE we observed that the algorithm does not have “convergence problem”. We were able to impute data with a high rate of incomplete information.



We think that NORM provides a useful information about the stability of the model.

For that its implementation in Stata would be a good complement for ICE. Don’t you think?

We found 2 versions of NORM code: miss.sas by Paul Allison and norm.R by Alvaro Novo.

We translated miss SAS routine into Stata-ado programming. Allison used SAS package IML (Interactive Matrix Language). We observed that IML is similar to Mata.



However, original code was not “optimized” as Schafer suggested in his book.

A month ago we move to R-code. We found that original code in Fortran was included in R routine.

So, norm.R allows to use Fortran code directly in R.

Speedy up for this meeting, we translated 800 lines of Fortran into Mata in a week.

However, our translation from R to Mata is not good enough… yet.


Problem

Besides technical issues on MI we have an unsolved topic for which your opinion is crucial: Aggregation We want to work at the household level, then aggregation

of individual information must be done somehow. In order to deal with missing observation at individual level

we could apply “improper imputations”: (1) replacing with zeros, or (2) replacing with a predicted value.

Alternative, we could code “missing” to the household if any member has missing observation. Because we lost information we discarded it.

Any feasible mixture? Is it possible to add variables in order to account for

incomplete information at individual level?


Multiple Imputation for households surveys

A comparison of methods

Stata Users Group Meeting

Rodrigo Alfaro – Marcelo Fuenzalida


Source: EFH 2007.

All Households: Conditional

N mean sd mean sdPrimary house 2,557 46,467,118 63,446,185 46,467,118 63,446,185

HD_sex 46,742,044 64,178,915 46,634,739 64,084,809HD_seq 46,519,877 64,309,665 46,690,809 64,581,258

uvis_sex 51,023,261 72,903,540 51,095,700 72,962,069uvis_seq 50,809,129 72,860,751 50,543,847 72,175,917

Vehicules 1,845 11,839,545 46,851,535 11,839,545 46,851,535HD_sex 12,020,063 47,783,641 11,926,336 47,152,707HD_seq 11,964,703 47,559,264 11,870,948 47,067,930

uvis_sex 11,636,439 45,164,884 11,689,710 45,217,017uvis_seq 11,904,938 45,975,034 11,823,094 45,687,387

N mean sd mean sdMortgage 632 21,299,214 19,204,692 21,299,214 19,204,692

HD_sex 21,201,238 19,271,787 21,143,672 19,104,947HD_seq 21,244,300 19,184,499 21,202,081 19,246,894

uvis_sex 21,832,867 20,593,237 21,852,518 20,790,692uvis_seq 21,799,370 20,704,561 21,926,977 20,889,226

769

m=5 m=10

2,637

m=5 m=10Assets

2,003

Debt


Source: EFH 2007.

All Households: Conditional

N mean sd mean sdDebt in retail companies 1,581 431,153 622,756 431,153 622,756

HD_sex 426,891 607,481 427,533 615,761HD_seq 422,734 600,901 426,400 614,459

uvis_sex 445,489 658,403 447,371 664,727uvis_seq 444,923 676,724 446,381 683,083

Bank Loan 623 6,175,111 9,850,386 6,175,111 9,850,386HD_sex 6,132,041 9,726,635 6,118,279 9,700,285HD_seq 6,108,118 9,742,051 6,163,441 9,800,157

uvis_sex 6,294,591 10,090,469 6,314,019 10,149,870uvis_seq 6,350,864 10,055,060 6,311,301 10,046,143

m=5 m=10

1,901

670

Debt (continue)

Documents

1 S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008 Multiple Imputation for households surveys A comparison of methods Stata Users Group Meeting