Upload
shon-mcbride
View
214
Download
0
Embed Size (px)
Citation preview
1 S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008
Multiple Imputation for households surveys
A comparison of methods
Stata Users Group Meeting
Rodrigo Alfaro – Marcelo Fuenzalida
2 S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008
Outline
Multiple Imputation (MI) Methods for MI Chilean Household Surveys Application of MI to EFH Comments and Conclusions Appendix
3 S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008
Multiple Imputation
In empirical applications researchers must work with incomplete data sets.
A “solution” to the problem above is known as Multiple Imputation (MI) procedure.
MI relies on the assumption Missing At Random: “…the probability of missing data on Y is unrelated to
the value of Y, after controlling for other variables in the analysis.” (Allison, 2002)
In our empirical application we assume MAR. Validity of assumption is beyond the scope of this
analysis.
4 S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008
Multiple Imputation
MI is based on the assumption that we have a good proxies of distributions of missing observations in the sample. Given this: We could “fill the blanks” taking random realizations. We create m-versions of the complete datasets in order
to reflect randomness of the procedure.
So, we change our incomplete data set for a set of complete ones. We do not “solve” missing data problem we just measure it.
Analysis of multiple data sets could be combine using Rubin’s rules (Rubin, 1987).
5 S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008
Methods for MI
We divide methods into 2 categories Conditional: Hot-Deck and Univariate. Multivariate: Normal and Chained Equations.
Hot-Deck (hotdeck.ado) Replace missing observations with an observed one taken
randomly within a specific group: males with college. Informal conversations: std. dev.’s are still small.
Univariate (uvis.ado) Regress variable with missing observations on exogenous
variables with no missing. Draw posterior of estimators (beta & sigma) “Predict” missing values.
6 S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008
Methods for MI
Chained Equation (ice.ado) Based on Univariate method, but with the possibility of
having missing values in exogenous variables. Using reverse equation missing values of exogenous
variables are replaced. Loop over previous steps.
Normal Assume Multivariate Normal. Estimate parameters using
EM algorithm (or other initial value), and draw imputations using Data Augmentation procedure.
Theory relies on the convergence of EM. No implemented in Stata. Schafer’s stand-alone
package.
7 S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008
Methods for MI
Schafer’s stand-alone package: Data.
8 S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008
Methods for MI
Schafer’s stand-alone package: EM.
9 S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008
Methods for MI
Schafer’s stand-alone package: DA.
10 S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008
Methods for MI
Schafer’s stand-alone package: DA.
11 S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008
Chilean Households Surveys
We have households surveys with a few number of waves: CASEN, and EPS. CASEN was created to measure poverty. EPS was created to evaluate pension system.
At the Central Bank we have been using these surveys to analyze financial fragility of households.
However, CASEN and EPS were not created for this purpose. We need new sample designs.
In 2007, we started a new survey designed for our purposes: EFH.
12 S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008
Chilean Households Surveys
Our surveys have different levels of information Personal information of each member of the household.
For example: age, year of education, labor income, etc. Aggregate information of the household. For example:
value of assets (cars, house, financial instruments, etc.), debts (mortgage, consumer loans, educational loans, etc.)
Our variables of interest could be irrelevant for some households in the sample. Many households have loans with retails-companies
instead of borrowing the money directly from banks. Few households have personal savings invested in
financial instruments such as stocks, bonds, etc.
13 S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008
Application of MI to EFH
Using conditional methods, we could attach the constraints to the imputation procedure. We are able to impute labor income for each member of
the household, considering only individual level vars. At the household level, we could impute “banks loans” in
a sub-sample of households that declared to have that kind of debt. We use as exogenous variables age, years of education, and gender of interviewee.
We impute “debt in retails-companies” with a different sub-sample but with the same exogenous variables.
But, we cannot impute “debt in retails-companies” with “banks loans” because sub-samples may be different.
14 S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008
Application of MI to EFH
Multivariate methods imply groups of households. Suppose that a household without a house, we could
pretend that the value of its house is “zero”. However, that will be affect the correlation between value of the house and total amount of debts.
Our first round includes 3 groups defined by the credit access. First group includes households without debts in
financial institutions and without any kind of assets. Second group includes households with real assets: cars,
and primary house. Third group adds households with debts in financial
institutions.
15 S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008
Missing Information
Source: EFH 2007.
Results for EFH
Missing values Total observations % of missingAssets Primary house 80 2637 3% 2557
Vehicules 158 2003 8% 1845
Debt Mortgage 137 769 18% 632Debt in retail companies 320 1901 17% 1581Bank loan 47 670 7% 623
4021
Number of observations with
data
Total Number of Households in EFH 2007
Low missing rate of information.
However, combining variables reduces sample size.
We will concentrate our analysis in the second group.
We use logit transformation to avoid unbounded results.
16 S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008
Source: EFH 2007.
Group 2: Conditional
N mean sd mean sdPrimary House 124 41,300,000 51,400,000 41,300,000 51,400,000
HD_sex 42,021,053 53,054,393 41,363,910 51,875,497HD_seq 40,804,511 50,395,197 41,843,008 53,564,263
uvis_sex 45,799,319 61,270,298 45,192,579 60,678,326uvis_seq 45,343,722 61,377,351 45,508,348 61,755,670
Vehicules 121 5,430,165 14,000,000 5,430,165 14,000,000HD_sex 5,515,865 13,545,393 5,606,466 14,084,994HD_seq 6,398,767 21,334,829 5,985,748 17,893,250
uvis_sex 5,732,371 13,955,271 5,678,109 13,815,379uvis_seq 5,861,634 16,442,425 6,043,486 15,476,726
Mortgage 109 14,500,000 10,800,000 14,500,000 10,800,000HD_sex 15,683,299 13,673,049 15,580,981 13,076,041HD_seq 15,753,395 12,699,439 15,778,602 13,257,939
uvis_sex 16,540,712 14,795,948 16,521,443 15,674,534uvis_seq 16,703,439 16,311,423 16,876,838 16,434,770
Debt in retail companies 118 564,897 1,076,239 564,897 1,076,239HD_sex 540,023 1,025,211 549,424 1,056,475HD_seq 543,912 1,029,810 548,536 1,032,109
uvis_sex 542,185 1,037,191 568,213 1,067,170uvis_seq 565,666 1,033,445 563,049 1,034,414
133
133
m=5 m = 10
133
133
17 S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008
Households with debt in retails companies. UVIS and HD could have lower variances than raw data. We note that ICE and NORM are consistent in sd’s.
Source: EFH.
Group 2: Multivariate
N mean sd mean sdPrimary House 124 41,300,000 51,400,000 41,300,000 51,400,000Norm 45,666,537 63,217,837 45,696,699 63,810,076Ice 42,870,280 57,188,693 42,981,100 57,419,325
Vehicules 121 5,430,165 14,000,000 5,430,165 14,000,000Norm 6,498,768 17,089,963 6,446,325 16,600,325Ice 7,282,372 19,242,160 6,558,340 17,211,530
Mortgage 109 14,500,000 10,800,000 14,500,000 10,800,000Norm 15,734,073 13,055,818 15,686,541 12,946,747Ice 15,181,216 12,585,790 15,431,263 12,719,416
Debt in retail companies 118 564,897 1,076,239 564,897 1,076,239Norm 635,011 1,242,141 654,637 1,285,391Ice 707,768 1,427,360 672,833 1,357,371
133
133
133
133
m=5 m = 10
18 S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008
Comments and Conclusions
In our empirical application Hot-Deck as well as Univariate imputation have smaller variances than multivariate methods.
Under a multivariate imputation we are able to have a reasonable standard deviation that reflects the uncertainty of complete data sets.
Moving to multivariate methods we have ICE and NORM. Both have advantages and disadvantages.
ICE is implemented in Stata with many features available to accommodate several models.
In that respect is more general than NORM.
19 S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008
Comments and Conclusions
NORM relies in EM algorithm in theoretical terms and DA for imputation method. For that we observed that we need convergence and “reasonable” positive definite matrix.
In practical terms we observed that a high rate of missing data is associated with non-converge of EM algorithm and/or some problems with DA.
In the case of ICE we observed that the algorithm does not have “convergence problem”. We were able to impute data with a high rate of incomplete information.
20 S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008
Comments and Conclusions
We think that NORM provides a useful information about the stability of the model.
For that its implementation in Stata would be a good complement for ICE. Don’t you think?
We found 2 versions of NORM code: miss.sas by Paul Allison and norm.R by Alvaro Novo.
We translated miss SAS routine into Stata-ado programming. Allison used SAS package IML (Interactive Matrix Language). We observed that IML is similar to Mata.
21 S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008
Comments and Conclusions
However, original code was not “optimized” as Schafer suggested in his book.
A month ago we move to R-code. We found that original code in Fortran was included in R routine.
So, norm.R allows to use Fortran code directly in R.
Speedy up for this meeting, we translated 800 lines of Fortran into Mata in a week.
However, our translation from R to Mata is not good enough… yet.
22 S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008
Problem
Besides technical issues on MI we have an unsolved topic for which your opinion is crucial: Aggregation We want to work at the household level, then aggregation
of individual information must be done somehow. In order to deal with missing observation at individual level
we could apply “improper imputations”: (1) replacing with zeros, or (2) replacing with a predicted value.
Alternative, we could code “missing” to the household if any member has missing observation. Because we lost information we discarded it.
Any feasible mixture? Is it possible to add variables in order to account for
incomplete information at individual level?
23 S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008
Multiple Imputation for households surveys
A comparison of methods
Stata Users Group Meeting
Rodrigo Alfaro – Marcelo Fuenzalida
24 S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008
Source: EFH 2007.
All Households: Conditional
N mean sd mean sdPrimary house 2,557 46,467,118 63,446,185 46,467,118 63,446,185
HD_sex 46,742,044 64,178,915 46,634,739 64,084,809HD_seq 46,519,877 64,309,665 46,690,809 64,581,258
uvis_sex 51,023,261 72,903,540 51,095,700 72,962,069uvis_seq 50,809,129 72,860,751 50,543,847 72,175,917
Vehicules 1,845 11,839,545 46,851,535 11,839,545 46,851,535HD_sex 12,020,063 47,783,641 11,926,336 47,152,707HD_seq 11,964,703 47,559,264 11,870,948 47,067,930
uvis_sex 11,636,439 45,164,884 11,689,710 45,217,017uvis_seq 11,904,938 45,975,034 11,823,094 45,687,387
N mean sd mean sdMortgage 632 21,299,214 19,204,692 21,299,214 19,204,692
HD_sex 21,201,238 19,271,787 21,143,672 19,104,947HD_seq 21,244,300 19,184,499 21,202,081 19,246,894
uvis_sex 21,832,867 20,593,237 21,852,518 20,790,692uvis_seq 21,799,370 20,704,561 21,926,977 20,889,226
769
m=5 m=10
2,637
m=5 m=10Assets
2,003
Debt
25 S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008
Source: EFH 2007.
All Households: Conditional
N mean sd mean sdDebt in retail companies 1,581 431,153 622,756 431,153 622,756
HD_sex 426,891 607,481 427,533 615,761HD_seq 422,734 600,901 426,400 614,459
uvis_sex 445,489 658,403 447,371 664,727uvis_seq 444,923 676,724 446,381 683,083
Bank Loan 623 6,175,111 9,850,386 6,175,111 9,850,386HD_sex 6,132,041 9,726,635 6,118,279 9,700,285HD_seq 6,108,118 9,742,051 6,163,441 9,800,157
uvis_sex 6,294,591 10,090,469 6,314,019 10,149,870uvis_seq 6,350,864 10,055,060 6,311,301 10,046,143
m=5 m=10
1,901
670
Debt (continue)