Variance Estimation When Donor Imputation is Used to Fill in Missing Values Jean-François Beaumont and Cynthia Bocci Statistics Canada Third International

Variance Estimation When Variance Estimation When Donor Imputation is Used to Donor Imputation is Used to

Fill in Fill in Missing ValuesMissing Values

Jean-François Beaumont and Cynthia BocciJean-François Beaumont and Cynthia BocciStatistics CanadaStatistics Canada

Third International Conference on Establishment Third International Conference on Establishment SurveysSurveys

Montréal, June 18-21, 2007Montréal, June 18-21, 2007

2

OverviewOverview

ContextContext

Donor imputationDonor imputation

Variance estimationVariance estimation

Simulation studySimulation study

ConclusionConclusion

3

Context Context

Population parameter to be estimated Population parameter to be estimated ::

Domain total: Domain total:

Estimator in the case of full response:Estimator in the case of full response:

Calibration estimator Calibration estimator

Horvitz-Thompson estimatorHorvitz-Thompson estimator

Uk kkdy ydt

sk kkkdy ydwt̂

4

Donor ImputationDonor Imputation

Imputed estimator :Imputed estimator :

With donor imputation, the imputed With donor imputation, the imputed value isvalue is

A variety of methods can be A variety of methods can be considered in order to find a donor considered in order to find a donor ll((kk) for the recipient ) for the recipient kk

sk kkk

Idy ydwt̂

otherwise,

,*k

rkk y

skyywith

rklk sklyy )(donorafor,)(*

5

Donor ImputationDonor Imputation

Two simple examples:Two simple examples:

Random Hot-Deck Imputation Within Random Hot-Deck Imputation Within ClassesClasses

Nearest-neighbour imputationNearest-neighbour imputation

Practical considerations that add Practical considerations that add some complexity to the imputation some complexity to the imputation process:process:

Post-imputation edit rulesPost-imputation edit rules

hierarchical imputation classes hierarchical imputation classes

6

Imputation ModelImputation Model

Most imputation methods can be Most imputation methods can be justified by an imputation model:justified by an imputation model:

The donor imputed estimator is The donor imputed estimator is assumed to be approximately assumed to be approximately unbiased under the model:unbiased under the model:

0),|,(

),|(

),|(2

rlkm

krkm

krkm

ssyyC

ssyV

ssyE

0,|ˆÊ )( msk

klkkkrdyIdym dwsstt

7

CurrentVariance CurrentVariance Estimation MethodsEstimation Methods

Assuming negligible sampling fractionsAssuming negligible sampling fractions

Chen and Shao (2000, JOS) for NN Chen and Shao (2000, JOS) for NN imputation imputation

Resampling methodsResampling methods

Our method is closely related to:Our method is closely related to:

Rancourt, Särndal and Lee (1994, proc. Rancourt, Särndal and Lee (1994, proc. SRMS): Assumes a ratio model holdsSRMS): Assumes a ratio model holds

Brick, Kalton and Kim (2004, SM): Brick, Kalton and Kim (2004, SM): Condition on the selected donorsCondition on the selected donors

8

Imputation Model Imputation Model ApproachApproach

Variance decomposition of Särndal Variance decomposition of Särndal (1992, SM):(1992, SM):

For any donor imputation method, For any donor imputation method, we have: we have:

MIXNRSAM

MIX

22

VVV

VˆÊˆVEÊ

dy

Idympqdypmdy

Idympq ttttt

m

r

siiikkdk

sk kdkIdy

dwkilIdwW

yWt

))((

where,ˆ

9

Estimation of the Estimation of the nonresponse variancenonresponse variance

The estimation of the nonresponse The estimation of the nonresponse variance is achieved by estimatingvariance is achieved by estimating

Noting that the nonresponse error Noting that the nonresponse error is:is:

Then, the nonresponse variance Then, the nonresponse variance estimator is:estimator is:

rdyIdymrdy

Idym ssttsstt ,|ˆˆV,|)ˆˆ(E 2

mr sk kkksk kkkdkdyIdy ydwydwWtt -)(ˆˆ

2222 ˆˆ)(,|ˆˆV̂ ksk kksk kkkdkrdyIdym

mrdwdwWsstt

10

Estimation of the mixed Estimation of the mixed componentcomponent

Similarly, the estimation of the mixed Similarly, the estimation of the mixed component is achieved by estimatingcomponent is achieved by estimating

The mixed component estimator is:The mixed component estimator is:

This component can be either positive This component can be either positive or negative and may not always be or negative and may not always be negligiblenegligible

rdydydyIdym sstttt ,|ˆ,ˆˆCov2

2 2ˆ ˆ ˆV 2 ( 1)( ) 2 ( 1)r m

MIX k dk k k k k k k k kk s k s

w W w d d w w d

11

Estimation of the sampling Estimation of the sampling variancevariance

Let be the full response Let be the full response variance est.variance est.

The strategy consists of The strategy consists of

EstimatingEstimating

Replace by their estimates the unknownReplace by their estimates the unknown

This leads to the sampling variance This leads to the sampling variance estimator: estimator:

E ( ) | , ,m r rv y s s Y

)ˆ(V̂)( θyv p

2and kk

msk

kkkk dwyv 22ˆ ˆ)1()(

12

Estimation of the sampling Estimation of the sampling variancevariance

This strategy is essentially equivalent This strategy is essentially equivalent toto Randomly imputing the missing values Randomly imputing the missing values

using the imputation modelusing the imputation model Computing the full response sampling Computing the full response sampling

variance estimator by treating these variance estimator by treating these imputed values as true valuesimputed values as true values

Repeating this process a large number of Repeating this process a large number of times and taking the average of the times and taking the average of the sampling variance estimatessampling variance estimates

Similar to multiple imputation Similar to multiple imputation sampling variance estimatorsampling variance estimator

13


Generated a population of size 1000Generated a population of size 1000 Two y-variables: Two y-variables:

LIN: Linear relationship between y and x LIN: Linear relationship between y and x NLIN: Nonlinear relationship between y NLIN: Nonlinear relationship between y

and xand x

Two different sample sizes: Two different sample sizes: Small sampling fraction: n=50Small sampling fraction: n=50 Large sampling fraction: n=500Large sampling fraction: n=500

Response probability depends on x Response probability depends on x with an average of 0.5with an average of 0.5

14


Imputation: Nearest-Neighbour Imputation: Nearest-Neighbour imputation using x as the matching imputation using x as the matching variablevariable

Estimation ofEstimation of

LIN: Linear model in perfect agreement LIN: Linear model in perfect agreement with the LIN y-variablewith the LIN y-variable

NPAR: Nonparametric estimation using NPAR: Nonparametric estimation using the procedure TPSPLINE of SAS the procedure TPSPLINE of SAS

2and kk

15


Two objectives:Two objectives: Compare the two ways of estimatingCompare the two ways of estimating

LIN and NPARLIN and NPAR

Compare three nonparametric methods:Compare three nonparametric methods: NPARNPAR

NPAR_Naïve: NPAR with the sampling NPAR_Naïve: NPAR with the sampling variance being estimated by the naïve variance being estimated by the naïve sampling variance (Brick, Kalton sampling variance (Brick, Kalton and Kim, 2004)and Kim, 2004)

CS : method of Chen and Shao (2000) CS : method of Chen and Shao (2000)

2and kk

)( yv

16

Results: Large sampling Results: Large sampling fractionfraction

MethoMethodd

Relative Bias Relative Bias in %in % RRMSE in %RRMSE in %

y-LINy-LIN y-y-NLINNLIN y-LINy-LIN y-y-

NLINNLIN

LINLIN -2.4-2.4 358.4358.4 15.715.7 514.1514.1

NPARNPAR -0.3-0.3 -18.8-18.8 21.521.5 54.654.6

17

Results: Small sampling Results: Small sampling fractionfraction

MethoMethodd



NLINNLIN

NPARNPAR -4.9-4.9 -13.3-13.3 41.841.8 245.4245.4

NPAR_NPAR_NaïveNaïve -5.9-5.9 -10.4-10.4 42.142.1 265.8265.8

CSCS -9.1-9.1 -9.4-9.4 52.852.8 257.8257.8

18

Results: Large sampling Results: Large sampling fractionfraction

MethoMethodd



NLINNLIN

NPARNPAR -0.3-0.3 -18.8-18.8 21.521.5 54.654.6

NPAR_NPAR_NaïveNaïve -0.3-0.3 -12.0-12.0 21.821.8 69.169.1

CSCS 33.933.9 59.659.6 53.753.7 118.7118.7

19


Nonparametric estimation of Nonparametric estimation of seems beneficial (robust) with seems beneficial (robust) with Nearest-Neighbour imputationNearest-Neighbour imputation

Our proposed method is valid even Our proposed method is valid even for large sampling fractionsfor large sampling fractions

It seems to be slightly better to use It seems to be slightly better to use our sampling variance estimator our sampling variance estimator instead of the naïve sampling instead of the naïve sampling variance estimatorvariance estimator

2and kk

20


Work done in the context of Work done in the context of developing a variance estimation developing a variance estimation system (SEVANI)system (SEVANI)

Methodology implemented in the Methodology implemented in the next version 2.0 of SEVANInext version 2.0 of SEVANI

Estimation of :Estimation of :

Linear modelLinear model

Nonparametric estimation Nonparametric estimation

2and kk

21

Thanks - MerciThanks - Merci

For more For more information information please please contactcontact

Pour plus Pour plus d’informationd’information, veuillez , veuillez contactercontacter

Jean-François BeaumontJean-François [email protected]@statcan.ca

Cynthia BocciCynthia BocciCynthia.BocciCynthia.Bocci@@statcan.castatcan.ca

Documents

Variance Estimation When Donor Imputation is Used to Fill in Missing Values Jean-François Beaumont and Cynthia Bocci Statistics Canada Third International