Upload
alicia-joseph
View
217
Download
3
Embed Size (px)
Citation preview
1
1
A statistical approach to A statistical approach to surrogate data surrogate data
Li-Chun Zhang
Statistics Norway
E-mail: [email protected]
2
A setting of surrogate data
• Target data– Directly collected, such as in sample surveys– For a subset of population, if available
• Surrogate data– To replace target data for statistical purposes, hence “surrogate”
For reasons such as cost, burden, scope, etc.– Secondary data of nature: Re-use of data collected for other purposes– Typically from multiple sources– Often for the entire population or a major part of it– Two examples:
Register-based Employment statistics (surrogate) & LFS (target) Self-administered census (surrogate) & post-census survey (target)
• Issues of concern:– Conditions for valid substitution– Associated statistical accuracy
3
Unit-specific approach: Equality vs. equivalence
• Unit-specific approach– Scheme
Link surrogate and target data at the micro level Estimation of relevant unit-specific misclassification rates Propagation of uncertainty to statistics of interest
– Two shortcomings Require micro-level linkage Unit-specific consistency may be irrelevant or misleading for uses
• Equality vs. equivalence – An example– Two binary data sets of the same size– Equal mean without the values being equal for all the units– Identical empirical CDF => identical statistical inference– Inequality may fail to reveal statistical equivalence
4
Other relevant situations
• Some settings– Indirect (proxy) interview– Unstable reporting– Mode effects– Public micro data
• Some observations– Unit-specific approach may not be applicable– Unit-specific equality may not even be desirable– To use surrogate data (Z) in place of target data (Y), together with
additional data (X) Joint distribution of (Z,Y|X) is not of primary interest for users Distribution of (Z,X) instead of distribution of (Y,X) is the issue
5
Validity and equivalence
• Valid surrogate data– Denote by f(x,y) and f(x,z) the distribution functions:
f(X=x, Z=y) = f(X=x) f(Z=y | X=x) = f(X=x) f(Y=y | X=x) = f(X=x, Y=y)– Example: X = age-sex grouping, Z = register-employment status,
Y = LFS-employment status according to ILO-definition– Example: Z = proxy-interview in LFS, Y = direct-interview in LFS– Equality of distribution can be assessed without linked / linkable data
• Empirically equivalent surrogate data– Denote by p(x,z; s1) and p(x,y; s2) the empirical distribution functions:
p(X=x; s1) p(Z=y | X=x; s1) = p(X=x; s2) p(Y=y | X=x, s2)– Equality on micro level not necessary & s1 may differ from s2– Parametric analogy: Statistical equivalence by Sufficiency Principle
6
Similar ideas in disclosure control literature
• Fienberg et al. (1998)– Random generation of “pseudo” micro data Z conditional on {x; s}– Parametric f(y | x) or empirical p(y | x)– Conditional validity in expectation, provided unbiased estimation
• Rubin (1993)– Synthetic data & Bayesian multiple-imputation framework– Random generation of population data + sampling– No particular emphasis on conditioning & validity instead of equivalence
• SARs– Sample of Anonymised Records from census data– Real data albeit anomymized– Valid surrogate data
• Micro simulation– Based on sample instead of census data– Random generation of “imaginary” micro data– Validity in expectation provided unbiased estimation of distribution
7
Some applications / implications?
• Statistics and inference based on surrogate data– Validity (or equivalence) vs. efficiency (or accuracy)– Example: Employment register (ER) vs. LFS
Deterministic ER-status by editing rules vs. valid ER-status for specific purposes Bias of invalid ER-status vs. variance of valid ER-status Balance in trade-off may change direction on more detailed levels
• Micro data for public use– Targeting full empirical equivalence followed by disclosure control (DC)– Equivalent data targeting at coarsened information (embedded DC)
• Micro calibration of surrogate data– Secondary population (U) data (X, Z1, Z2, …, Zk; U) & target sample data
(X, Y1; s1), (X, Y2; s2), …, (X, Yk; sk) --- different units in general– Surrogate data (X*, Z1*, Z2*, …, Zk*) with marginal validity btw. (X; U) and
(X*; U), (X*, Z1*; U) and (X, Y1; s1), …, (X*, Zk*; U) and (X, Yk; sk)– Conditional surrogate data (X, Z1*, Z2*, …, Zk*) with marginal validity btw.
(X, Z1*; U) and (X, Y1; s1), …, (X, Zk*; U) and (X, Yk; sk)?– Alternative to statistical matching by Conditional Independence Assumption– Uncertainty?